|
On 02/11/2012 22:21, Coda Highland wrote:
On Fri, Nov 2, 2012 at 2:11 PM, Rena <hyperhacker@gmail.com> wrote:I think the reason combining characters exist is that in some languages the number of valid combinations is quite huge. Korean writing for one example has each character made by combining multiple base characters.Except that Korean has all of the combined characters in Unicode too. There aren't THAT many of them, and not all combinations are legal. The reason combining characters exist is that it dramatically simplifies collation. It makes it clear that "A with diaeresis" is still an A and sorts as one and can be searched for in a body of text as one.
Yes, precisely. And then, there is a single coded form each character (as users or programmers consider them). For instance, there would not be anymore 4 ways of encoding "Lüâ". How do you then search for it in a text, if there exist precombined characters?
Anyway, the clusterpork that is Unicode combining glyphs is not really a Lua bug... Probably we should be pestering the authors of the spec.You're over two decades too late for that.
Actually, maybe the initial plan was for UCS (the Universal character Set) to hold only plain "abstract characters" without any precombinations. This obviously dramatically reduces the number of codes (a few tens of thousands). Reason why, probably, they thought 16 bits were far enough... and widely used libraries are based on this (erroneous) fact (eg ICU or java strings) which only changed late, after the standard was already in use.
/s/ Adam
Denis