[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Should Lua be more strict about Unicode errors?
- From: Coda Highland <chighland@...>
- Date: Thu, 3 Sep 2015 08:48:55 -0700
On Thu, Sep 3, 2015 at 12:21 AM, Ricardo Ramos Massaro
<ricardo.massaro@gmail.com> wrote:
> On Wed, Sep 2, 2015 at 4:24 PM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
>> Actually, I have only just for the first time ever read all of the
>> Wikipedia page. At the bottom, it says:
>>
>> WTF-8 (Wobbly Transformation Format − 8-bit) is UTF-8 where the
>> encodings of the surrogate halves (U+D800 through U+DFFF) are allowed.
>> This is necessary to store possibly-invalid UTF-16, such as Windows
>> filenames. The term seems to have come from the Rust programming
>> language.[31] Many systems that deal with UTF-8 work this way without
>> considering it a different encoding, as it is simpler. The source code
>> samples above work this way, for instance.
>
> Note that Wikipedia is misleading when it says "Many systems that deal
> with UTF-8 work this way without considering it a different encoding,
> as it is simpler."
>
> WTF-8 dictates that you take special care when concatenating strings:
> if the first string ends with a leading surrogate half and the second
> string starts with an trailing surrogate half, you have to merge the
> two surrogate halves into a single code point encoded in valid UTF-8.
>
> This is a minor point, but it's important to note that Lua can't claim
> to support WTF-8 in its current state (nor am I suggesting it should).
>
> -Ricardo
>
For what it's worth, that doesn't have to be done on concatenation.
Rust might do it that way, but there's room conceptually to support
naive UTF-8 encoding and just having a function that does a
normalization pass over the string (rewriting nonstandard encodings,
merging surrogates) and throwing errors at that point.
/s/ Adam