lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 22-May-05, at 11:55 AM, Asko Kauppi wrote:

I've been thinking about UTF-8 and Lua lately, and wonder how much 
work it would be to actually support that in Lua "out of the box".  
There are some programming languages (s.a. Tck) that claim already to 
do that, and I feel the concept would match Lua's targets and 
philosophy rather nicely.
I guess that depends on what you mean by "support". Lua currently does 
not interfere with UTF-8, but it lacks:
1) UTF8 validation
2) A mechanism for declaring encoding of a source file
3) An escape mechanism for including UTF-8 in string literals (so that it is only possible by either using a UTF-8 aware text editor, or manually working out the decimal \ escapes for a particular character)
4) Multicharacter aware string patterns with Unicode character classes
5) Various utilities, including single character-code conversion, code counting, normalization, etc.
Various people have made various attempts to implement some or all of 
these features; standard libraries exist for them (but they are 
"bulky").
I understand UTF-8 might not be everyone's favourite, but it is mine. 
:) And having a working framework (str:bytes(), str:length(), 
str:width()) could easily be adopted to other extended encoding 
schemes as well.
There are arguments and counter-arguments for all of the standard 
Unicode Transfer Formats. UTF-8 is fairly easy to work with if the 
majority of the work is simply moving strings between components; it is 
less ideal for text processing, for which UTF-16 is generally better 
(there are arguments and counter-arguments about using a 32-bit 
internal representation; the 16-bit representation is still variable 
width because of surrogate pairs, but the fact that graphemes are often 
represented as multiple character codes means that display-oriented 
text processing is going to have to be able to deal with variable 
length grapheme codes regardless of base encoding.)
The reason I'm bringing this up right now, is that the issue could suit nicely with the 5.1 "every type has a metatable" way of thinking; would it warrant an opportunity to have a closer look at what Lua means by 'strings' (or rather, their encoding) anyhow?
I'm pretty firmly of the belief that keeping strings as octet-sequences 
is really a simplification. It is not uncommon to have a mixture of 
character encodings in a single application, so assigning a metatable 
to the string type will often prove unsatisfactory. I'm not really sure 
what the solution is, but I have been bitten more than once by 
programming languages such as Perl and Python which have glued 
character encoding on to their basic string types. (In Python, for 
example, a UTF-8 sequence is *not* of type Unicode, which can be 
seriously awkward.)
If strings are simply octet-sequences, it becomes the programmer's 
responsibility to identify (or remember) the encoding for each string; 
that can also be awkward but it has the advantage of being clear.
For the record, there are some hidden subtleties, particularly in the 
area of normalization. Unicode does not really specify a canonical 
normalization, but it is clear that the intent is that the two 
non-compatibility formats do define canonical equality comparison. 
Unfortunately, this would have a significant impact on the use of 
Unicode strings as table keys (which is, indeed, visible in both Perl 
and Python). UTF-8 at least has the virtue that any string which only 
contains codes 0-127 (decimal) is identical between UTF-8 and 
ISO-8859-x, and furthermore that all normalization forms are the 
identity function for such strings.