Re: Lua 5.1 and UTF-8 ?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Lua 5.1 and UTF-8 ?
From: Asko Kauppi <askok@...>
Date: Sun, 22 May 2005 21:53:06 +0300

What I'd propose is a (built-in) interface that defines the_operations_ available for the strings, and a mechanism to customizethem (regular Lua overrides should do). Also, such interface wouldtake a stand as to the binary/encoded string interpretations of a Luastring; this is probably the number one task for it.


We need:
    - string:width()
    - string:length()
    - string:bytes()

- regular expression handling (here, could we take thepresumption that binary data is never regex'ed? perphaps not)- string:byte() and/or the new [] mechanism (here, again,clashing with binary/utf-8 requirements)- literals: ability to write "\x0416" or similar and get theminto the right multi-byte (1-5) UTF-8 codes

All this can surely be done, and it will. The main question was onthe demand for such, and the vision/timetable of Lua authors. I'msure they'll be seeing such requests, and "built-in" ability tohandle UTF-8 could prove to be a vital "tipping point" for Luaevaluators.

I'm sure it will in a project I'm closely connected with.

-ak


Rici Lake kirjoitti 22.5.2005 kello 20.49:

On 22-May-05, at 11:55 AM, Asko Kauppi wrote:
I've been thinking about UTF-8 and Lua lately, and wonder how muchwork it would be to actually support that in Lua "out of thebox". There are some programming languages (s.a. Tck) that claimalready to do that, and I feel the concept would match Lua'stargets and philosophy rather nicely.
I guess that depends on what you mean by "support". Lua currentlydoes not interfere with UTF-8, but it lacks:
1) UTF8 validation
2) A mechanism for declaring encoding of a source file
3) An escape mechanism for including UTF-8 in string literals (sothat it is only possible by either using a UTF-8 aware text editor,or manually working out the decimal \ escapes for a particularcharacter)
4) Multicharacter aware string patterns with Unicode character classes
5) Various utilities, including single character-code conversion,code counting, normalization, etc.
Various people have made various attempts to implement some or allof these features; standard libraries exist for them (but they are"bulky").
I understand UTF-8 might not be everyone's favourite, but it ismine. :) And having a working framework (str:bytes(), str:length(), str:width()) could easily be adopted to other extendedencoding schemes as well.
There are arguments and counter-arguments for all of the standardUnicode Transfer Formats. UTF-8 is fairly easy to work with if themajority of the work is simply moving strings between components;it is less ideal for text processing, for which UTF-16 is generallybetter (there are arguments and counter-arguments about using a 32-bit internal representation; the 16-bit representation is stillvariable width because of surrogate pairs, but the fact thatgraphemes are often represented as multiple character codes meansthat display-oriented text processing is going to have to be ableto deal with variable length grapheme codes regardless of baseencoding.)
The reason I'm bringing this up right now, is that the issue couldsuit nicely with the 5.1 "every type has a metatable" way ofthinking; would it warrant an opportunity to have a closer look atwhat Lua means by 'strings' (or rather, their encoding) anyhow?
I'm pretty firmly of the belief that keeping strings as octet-sequences is really a simplification. It is not uncommon to have amixture of character encodings in a single application, soassigning a metatable to the string type will often proveunsatisfactory. I'm not really sure what the solution is, but Ihave been bitten more than once by programming languages such asPerl and Python which have glued character encoding on to theirbasic string types. (In Python, for example, a UTF-8 sequence is*not* of type Unicode, which can be seriously awkward.)
If strings are simply octet-sequences, it becomes the programmer'sresponsibility to identify (or remember) the encoding for eachstring; that can also be awkward but it has the advantage of beingclear.
For the record, there are some hidden subtleties, particularly inthe area of normalization. Unicode does not really specify acanonical normalization, but it is clear that the intent is thatthe two non-compatibility formats do define canonical equalitycomparison. Unfortunately, this would have a significant impact onthe use of Unicode strings as table keys (which is, indeed, visiblein both Perl and Python). UTF-8 at least has the virtue that anystring which only contains codes 0-127 (decimal) is identicalbetween UTF-8 and ISO-8859-x, and furthermore that allnormalization forms are the identity function for such strings.

References:
- Lua 5.1 and UTF-8 ?, Asko Kauppi
- Re: Lua 5.1 and UTF-8 ?, Rici Lake

Prev by Date: Re: Lua 5.1 and UTF-8 ?
Next by Date: Re: Lua 5.1 and UTF-8 ?
Previous by thread: Re: Lua 5.1 and UTF-8 ?
Next by thread: Re: Lua 5.1 and UTF-8 ?
Index(es):
- Date
- Thread