[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: LPeg support for utf-8
- From: David Given <dg@...>
- Date: Fri, 01 Apr 2011 18:49:53 +0000
On 01/04/11 17:37, Marc Balmer wrote:
[...]
> On the C level, quite a lot. strlen() and friends can no longer be
> used, printf format strings like "%20s" don't work anymore etc. Not to
> speak about string comparison, collation etc. Since I am not familiar
> with LPeg's implementation, that is about all I can say.
Determining the length of a Unicode string is a pretty fuzzy concept
anyway --- AFAIK the only way to do it is to break it up into grapheme
clusters and determine the size of each grapheme cluster individually
(which may vary according to font).
I tend to use a cheap and nasty mechanism for console applications that
assumes that each code point is a grapheme cluster, and then uses a set
of rules to decide whether they're of width 1 and 2. This works most of
the time but not all of the time. See:
http://wordgrinder.hg.sourceforge.net/hgweb/wordgrinder/wordgrinder/file/f658d1e8f1f3/src/c/emu/wcwidth.c
In terms of what I'd like from LPEG is a set of primitives for matching
a single code point and a single grapheme cluster (treating them as Lua
strings, i.e. sequences of bytes). This would allow easier parsing of
UTF-8 strings. The collation stuff might be useful but not only is it
hideously complicated and involving massive tables, but I've never
actually found a need for it, so I'd willing to live without it.
--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup
Attachment:
signature.asc
Description: OpenPGP digital signature