lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Wed, Sep 9, 2015 at 2:50 AM, Dirk Laurie <dirk.laurie@gmail.com> wrote:
> 2015-09-08 22:51 GMT+02:00 Ross Berteig <Ross@cheshireeng.com>:
>
>> UTF-8 is at least normalizable in a way that would stabilize and
>> be immune to further normalization.
>
> I think the intention of the disclaimer "Any operation that needs
> the meaning of a character, such as character classification,
> is outside its scope. " is that the utf8 library does not claim to
> provide the full Monty. This discussion has amply proved that
> it is a nontrivial task to provide such a library.
>
> In the documentation of the utf8 library there are provisos like
> "assuming that the subject is a valid UTF-8 string". The scope
> of the manual does not include spelling out what happens
> when something is out of spec. For example, it is nowhere stated
> what #tbl returns when the table is not a sequence.
>
> I'm happy that the manual says enough to warn people that the
> utf8 library is not an implementation of a standard.
>
> ~~~
> A logician, a mathematician and a salesman visited Namibia
> for the first time. From the window of their bus, a karakul
> sheep could be seen.
>
> "Amazing", said the salesman. "The sheep in Namibia are black".
>
> "No", corrected the mathematician. "At least one sheep in
> Namibia is black."
>
> The logician pursed his lips and slowly brought the forefinger
> and thumb of his right hand together. "There is at least one
> sheep in Namibia, and the side of it that we can see is black."
> ~~~
>

The normalization to which I refer would be in scope for the limited
subset that the utf8 library supports -- simply converting all code
points in the string to a non-variable-width encoding (UCS-4),
collapsing paired surrogates in the process, and then re-encoding the
result into UTF-8. This process operates only on the byte-level
representation of the string and not upon the semantic meaning of any
codepoint therein except for surrogate pairs, which can be identified
by a straightforward range check.

Given that Lua strings are length-tagged instead of null-terminated,
and given that the input string should always be consumed one byte at
a time (that is, don't assume that a codepoint's initial bits
accurately indicate its length, but consume continuation bytes until
you reach a non-continuing byte or the end of string) it is not
possible to construct a string that will cause such a normalization
pass to crash or run indefinitely unless that string would cause that
to happen anyway (i.e. if you could crash Lua without needing utf8).

/s/ Adam