|
Miles Bader wrote: [...]
It seems there needs to be a clear distinction between "raw char" (given that lpeg is quite usable for binary data) and "unicode char".
The problem is that Unicode doesn't really have any such concept as a 'character', which means that traditional string handling methods basically don't work with it (even if you ignore UTF-8 encoding). A single displayable thing can actually be made up of several Unicode code points, and may even have several different (but technically equivalent) representations.
I'm afraid it's just a fundamentally hard problem, and I haven't seen any decent abstractions over it yet.
Making P(x) count utf8 chars would certainly be convenient for people reading utf8 files, but... it doesn't seem the cleanest thing in general....
*Nothing* about Unicode is clean... -- David Given dg@cowlark.com