[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Should Lua be more strict about Unicode errors?
- From: Ricardo Ramos Massaro <ricardo.massaro@...>
- Date: Thu, 3 Sep 2015 04:21:56 -0300
On Wed, Sep 2, 2015 at 4:24 PM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> Actually, I have only just for the first time ever read all of the
> Wikipedia page. At the bottom, it says:
>
> WTF-8 (Wobbly Transformation Format − 8-bit) is UTF-8 where the
> encodings of the surrogate halves (U+D800 through U+DFFF) are allowed.
> This is necessary to store possibly-invalid UTF-16, such as Windows
> filenames. The term seems to have come from the Rust programming
> language.[31] Many systems that deal with UTF-8 work this way without
> considering it a different encoding, as it is simpler. The source code
> samples above work this way, for instance.
Note that Wikipedia is misleading when it says "Many systems that deal
with UTF-8 work this way without considering it a different encoding,
as it is simpler."
WTF-8 dictates that you take special care when concatenating strings:
if the first string ends with a leading surrogate half and the second
string starts with an trailing surrogate half, you have to merge the
two surrogate halves into a single code point encoded in valid UTF-8.
This is a minor point, but it's important to note that Lua can't claim
to support WTF-8 in its current state (nor am I suggesting it should).
-Ricardo