|
On 02/11/2012 17:11, M. Edward (Ed) Borasky wrote:
Unicode in general and UTF-8 in particular are quickly becoming indispensable and Lua programmers need a standardized way of dealing with them, either in libraries or in extensions to the language syntax and semantics. Personally I favor libraries since they can be blazingly fast and don't break existing code. But they do need to be there and work.
I planned for a while to work only with genuine unicode-aware libraries for text processing (and I even had a prototype for one such lib in and for Lua). however, I had to go back to plain byte string for the following reason: unicode abstract characters, that is what a unicode code represents, are not characters. What they are is what the standard team decided to encode. There are as one expects simple, base characters such as 'a', control codes, a bunch of eosteric special codes, and tons of *combining* codes which form *actual characters* when composed with base codes. This means that a character is represented by a suite of n code (n has no formal limit), each encoded a 1-4 bytes in utf-8. To add a bit a complication, unicode (or rather UCS) alse defines precomposed ocdes for precomposed characters. Which means the letter 'â' may be UCS-coded (in code points, not bytes) as 1 single code or 2 code, 1 for bas 'a', one for combining '^'. I guess you start to imagine the mess to get things right and safe. For instance, how does one search for a word with 'â'? We need to first normalise to decomposed form (which is faster and also has the advantage of informing about sub-character units such as '^'); but this require goruping codes into characters and sorting them (yes, order of combinants is not defined, axcept for the base, and htere are exceptions). All of this, after decoding from utf8 to a string of unicode codes. This is doable, but much complication, I guess. Maybe I used a wrong approach, but after tons of exchanges on the topic with experts, no one could find a better solution. There is, I guess, no hope to get back the ideal simplicity of 1 char <--> 1 repr (and even less representations of equal lengths) we lived with in ascii & iso-latin times. There is affordable way to get strings as a sequences of chars, with s[i] = ith char, exactly, and complete.
DenisPS: The reasons why were introduced composite codes (which are the core source of the issue, for me, else characters would have a single representation), in addition of palin decomposed forms which are the base UCS coding, and why is used a misleading term like "abstract character" remain unknwon to me.