On 22-May-05, at 11:55 AM, Asko Kauppi wrote:
I've been thinking about UTF-8 and Lua lately, and wonder how much
work it would be to actually support that in Lua "out of the
box". There are some programming languages (s.a. Tck) that claim
already to do that, and I feel the concept would match Lua's
targets and philosophy rather nicely.
I guess that depends on what you mean by "support". Lua currently
does not interfere with UTF-8, but it lacks:
1) UTF8 validation
2) A mechanism for declaring encoding of a source file
3) An escape mechanism for including UTF-8 in string literals (so
that it is only possible by either using a UTF-8 aware text editor,
or manually working out the decimal \ escapes for a particular
character)
4) Multicharacter aware string patterns with Unicode character classes
5) Various utilities, including single character-code conversion,
code counting, normalization, etc.
Various people have made various attempts to implement some or all
of these features; standard libraries exist for them (but they are
"bulky").
I understand UTF-8 might not be everyone's favourite, but it is
mine. :) And having a working framework (str:bytes(), str:length
(), str:width()) could easily be adopted to other extended
encoding schemes as well.
There are arguments and counter-arguments for all of the standard
Unicode Transfer Formats. UTF-8 is fairly easy to work with if the
majority of the work is simply moving strings between components;
it is less ideal for text processing, for which UTF-16 is generally
better (there are arguments and counter-arguments about using a 32-
bit internal representation; the 16-bit representation is still
variable width because of surrogate pairs, but the fact that
graphemes are often represented as multiple character codes means
that display-oriented text processing is going to have to be able
to deal with variable length grapheme codes regardless of base
encoding.)
The reason I'm bringing this up right now, is that the issue could
suit nicely with the 5.1 "every type has a metatable" way of
thinking; would it warrant an opportunity to have a closer look at
what Lua means by 'strings' (or rather, their encoding) anyhow?
I'm pretty firmly of the belief that keeping strings as octet-
sequences is really a simplification. It is not uncommon to have a
mixture of character encodings in a single application, so
assigning a metatable to the string type will often prove
unsatisfactory. I'm not really sure what the solution is, but I
have been bitten more than once by programming languages such as
Perl and Python which have glued character encoding on to their
basic string types. (In Python, for example, a UTF-8 sequence is
*not* of type Unicode, which can be seriously awkward.)
If strings are simply octet-sequences, it becomes the programmer's
responsibility to identify (or remember) the encoding for each
string; that can also be awkward but it has the advantage of being
clear.
For the record, there are some hidden subtleties, particularly in
the area of normalization. Unicode does not really specify a
canonical normalization, but it is clear that the intent is that
the two non-compatibility formats do define canonical equality
comparison. Unfortunately, this would have a significant impact on
the use of Unicode strings as table keys (which is, indeed, visible
in both Perl and Python). UTF-8 at least has the virtue that any
string which only contains codes 0-127 (decimal) is identical
between UTF-8 and ISO-8859-x, and furthermore that all
normalization forms are the identity function for such strings.