Re: Clearing up misconceptions about characters vs bytes in the manual

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Clearing up misconceptions about characters vs bytes in the manual
From: spir <denis.spir@...>
Date: Fri, 02 Nov 2012 19:55:41 +0100

On 02/11/2012 17:11, M. Edward (Ed) Borasky wrote:

Unicode in general and UTF-8 in particular are quickly becoming
indispensable and Lua programmers need a standardized way of dealing
with them, either in libraries or in extensions to the language syntax
and semantics. Personally I favor libraries since they can be
blazingly fast and don't break existing code. But they do need to be
there and work.

I planned for a while to work only with genuine unicode-aware libraries for textprocessing (and I even had a prototype for one such lib in and for Lua).however, I had to go back to plain byte string for the following reason: unicodeabstract characters, that is what a unicode code represents, are not characters.What they are is what the standard team decided to encode. There are as oneexpects simple, base characters such as 'a', control codes, a bunch of eostericspecial codes, and tons of *combining* codes which form *actual characters* whencomposed with base codes.This means that a character is represented by a suite of n code (n has no formallimit), each encoded a 1-4 bytes in utf-8. To add a bit a complication, unicode(or rather UCS) alse defines precomposed ocdes for precomposed characters. Whichmeans the letter 'â' may be UCS-coded (in code points, not bytes) as 1 singlecode or 2 code, 1 for bas 'a', one for combining '^'. I guess you start toimagine the mess to get things right and safe.For instance, how does one search for a word with 'â'? We need to firstnormalise to decomposed form (which is faster and also has the advantage ofinforming about sub-character units such as '^'); but this require gorupingcodes into characters and sorting them (yes, order of combinants is not defined,axcept for the base, and htere are exceptions). All of this, after decoding fromutf8 to a string of unicode codes.This is doable, but much complication, I guess. Maybe I used a wrong approach,but after tons of exchanges on the topic with experts, no one could find abetter solution.There is, I guess, no hope to get back the ideal simplicity of 1 char <--> 1repr (and even less representations of equal lengths) we lived with in ascii &iso-latin times. There is affordable way to get strings as a sequences of chars,with s[i] = ith char, exactly, and complete.


Denis

PS: The reasons why were introduced composite codes (which are the core sourceof the issue, for me, else characters would have a single representation), inaddition of palin decomposed forms which are the base UCS coding, and why isused a misleading term like "abstract character" remain unknwon to me.

Follow-Ups:
- Re: Clearing up misconceptions about characters vs bytes in the manual, Rena
- Re: Clearing up misconceptions about characters vs bytes in the manual, William Ahern

References:
- Clearing up misconceptions about characters vs bytes in the manual, Rob Hoelz
- Re: Clearing up misconceptions about characters vs bytes in the manual, Rapin Patrick
- Re: Clearing up misconceptions about characters vs bytes in the manual, M. Edward (Ed) Borasky

Prev by Date: Re: Clearing up misconceptions about characters vs bytes in the manual
Next by Date: Re: Bug: Literal strings in long format are not quite literal.
Previous by thread: Re: Clearing up misconceptions about characters vs bytes in the manual
Next by thread: Re: Clearing up misconceptions about characters vs bytes in the manual
Index(es):
- Date
- Thread