[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: unicode support in lua
- From: David Kastrup <dak@...>
- Date: Thu, 26 Apr 2007 13:09:12 +0200
Klaus Ripke <paul-lua@malete.org> writes:
> On Thu, Apr 26, 2007 at 11:00:46AM +0200, David Kastrup wrote:
>> And so on. slnunicode does not actually do much in the area of
>> verification.
> The statement is:
> "
> According to http://ietf.org/rfc/rfc3629.txt we support up to 4-byte
> (21 bit) sequences encoding the UTF-16 reachable 0-0x10FFFF.
> Any byte not part of a 2-4 byte sequence in that range decodes to itself.
> Ill formed (non-shortest) "C0 80" will be decoded as two code points C0 and 80,
> not code point 0; see security considerations in the RFC.
> However, UTF-16 surrogates (D800-DFFF) are accepted.
> "
>
> Decode-encode always gives valid UTF-8.
But it is not an unambigous representation of the input. Personally,
I favor the strategy "any byte not part of a legal minimal 1-4 byte
sequence decodes to 0x1100xx, and values 0x1100xx encode as single
bytes xx again". Note that the utf-8 coding algorithm easily supports
values in that range, so one can still do string manipulation as
usual. It is also easy to weed out/flag illegal bytes. It also means
that one has different procedures for encoding into internally used
(always valid, except that it may contain patterns for 0x1100xx) utf-8
(which is basically a packed array representation) and external utf-8
(arbitrary bytes may be produced by reencoding).
The disadvantage is that illegal bytes need 4 bytes for
representation: that means that decoding garbage might blow up the
byte count by a maximum of 4.
The advantage is that processing can rely on characteristics of the
patterns.
>> It is not all too clear in my opinion how one could create a small
>> footprint Lua that supported byte arrays (if you want to, unibyte
>> strings) and multi-byte character strings where the characters
>> actually formed atomic string components.
> slnunicode supports both modes.
> The footprint is mostly about 12K for the unicode character table.
Please note that slnunicode does not really procure strings where the
atomic elements are Unicode characters: string indices and similar
things are always byte-based.
This is more or less what the first iteration of multibyte-support in
Emacs 20 was like. People hated it.
>> In short: proper utf-8 support comes at a price, and even large
>> closely related projects don't arrive at the same solutions.
> well, the UTF-8 encoding is not the hard part.
> slnunicode is lacking a lot of unicode features like special casing,
> canonical de/composition and collations.
I am more worried about the indexing and atomicity of string
characters. For the programmer, no model except a packed array of
unicode-characters makes sense.
As soon as you have to continuously worry about byte counts instead of
character counts, the complexity of the programming model explodes.
--
David Kastrup