lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]



What I'd propose is a (built-in) interface that defines the _operations_ available for the strings, and a mechanism to customize them (regular Lua overrides should do). Also, such interface would take a stand as to the binary/encoded string interpretations of a Lua string; this is probably the number one task for it.

We need:
    - string:width()
    - string:length()
    - string:bytes()
- regular expression handling (here, could we take the presumption that binary data is never regex'ed? perphaps not) - string:byte() and/or the new [] mechanism (here, again, clashing with binary/utf-8 requirements) - literals: ability to write "\x0416" or similar and get them into the right multi-byte (1-5) UTF-8 codes

All this can surely be done, and it will. The main question was on the demand for such, and the vision/timetable of Lua authors. I'm sure they'll be seeing such requests, and "built-in" ability to handle UTF-8 could prove to be a vital "tipping point" for Lua evaluators.
I'm sure it will in a project I'm closely connected with.

-ak


Rici Lake kirjoitti 22.5.2005 kello 20.49:


On 22-May-05, at 11:55 AM, Asko Kauppi wrote:



I've been thinking about UTF-8 and Lua lately, and wonder how much work it would be to actually support that in Lua "out of the box". There are some programming languages (s.a. Tck) that claim already to do that, and I feel the concept would match Lua's targets and philosophy rather nicely.


I guess that depends on what you mean by "support". Lua currently does not interfere with UTF-8, but it lacks:

1) UTF8 validation
2) A mechanism for declaring encoding of a source file
3) An escape mechanism for including UTF-8 in string literals (so that it is only possible by either using a UTF-8 aware text editor, or manually working out the decimal \ escapes for a particular character)
4) Multicharacter aware string patterns with Unicode character classes
5) Various utilities, including single character-code conversion, code counting, normalization, etc.

Various people have made various attempts to implement some or all of these features; standard libraries exist for them (but they are "bulky").



I understand UTF-8 might not be everyone's favourite, but it is mine. :) And having a working framework (str:bytes(), str:length (), str:width()) could easily be adopted to other extended encoding schemes as well.


There are arguments and counter-arguments for all of the standard Unicode Transfer Formats. UTF-8 is fairly easy to work with if the majority of the work is simply moving strings between components; it is less ideal for text processing, for which UTF-16 is generally better (there are arguments and counter-arguments about using a 32- bit internal representation; the 16-bit representation is still variable width because of surrogate pairs, but the fact that graphemes are often represented as multiple character codes means that display-oriented text processing is going to have to be able to deal with variable length grapheme codes regardless of base encoding.)


The reason I'm bringing this up right now, is that the issue could suit nicely with the 5.1 "every type has a metatable" way of thinking; would it warrant an opportunity to have a closer look at what Lua means by 'strings' (or rather, their encoding) anyhow?


I'm pretty firmly of the belief that keeping strings as octet- sequences is really a simplification. It is not uncommon to have a mixture of character encodings in a single application, so assigning a metatable to the string type will often prove unsatisfactory. I'm not really sure what the solution is, but I have been bitten more than once by programming languages such as Perl and Python which have glued character encoding on to their basic string types. (In Python, for example, a UTF-8 sequence is *not* of type Unicode, which can be seriously awkward.)

If strings are simply octet-sequences, it becomes the programmer's responsibility to identify (or remember) the encoding for each string; that can also be awkward but it has the advantage of being clear.

For the record, there are some hidden subtleties, particularly in the area of normalization. Unicode does not really specify a canonical normalization, but it is clear that the intent is that the two non-compatibility formats do define canonical equality comparison. Unfortunately, this would have a significant impact on the use of Unicode strings as table keys (which is, indeed, visible in both Perl and Python). UTF-8 at least has the virtue that any string which only contains codes 0-127 (decimal) is identical between UTF-8 and ISO-8859-x, and furthermore that all normalization forms are the identity function for such strings.