[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Matching multibyte alphabetical characters with LPeG
- From: Jay Carlson <nop@...>
- Date: Mon, 18 Jun 2012 22:59:24 -0400
On Jun 17, 2012, at 7:13 PM, Miles Bader wrote:
> Jay Carlson <nop@nop.com> writes:
>> sunk-cost:slnunicode-1.1a nop$ size slnudata.o
>> __TEXT __DATA __OBJC others dec hex
>> 0 14012 0 0 14012 36bc
>>
>> No, it does not provide enough to write a bidi renderer, but it does
>> characterize each code point as one of 30 classes--and includes
>> toupper/tolower/totitlecase.
>>
>> http://files.luaforge.net/releases/sln/slnunicode
>
> Hmm, slnunicode seems to use clever techniques to compress the table
> (note, though, that it only supports the BMP [16-bit characters],
> which is kind of a lose ... this is 2012, people!).
Yeah, I saw the BMP limitation and sighed. The compression is clever, but it does depend on there not being very much information stored.
> Still, it's functionality that's best left to a separate library, not
> something that should be in LPEG.
I don't think it is unreasonable to ask for a capture of a run of alphabetic grapheme clusters; this is what isalpha() does for "å" in single-byte locales. You will note the name of the author of the original post in this thread can't be written in ASCII but can be in Latin-1. UTF-8 is making Western Europe and the (non-US) Americas suffer the same problems the rest of the world deals with all the time. Leaning on precomposed characters seems like the same kind of restriction as the BMP.
I do think there is something distasteful about keeping the table itself in the core system. It would be nice to separate the mechanism from the data.
>> There's still the grapheme problem for å vs å; hopefully you can't
>> tell the second is "a".."␣̊". [1]
>>
>> How should lpeg match the one with a separate combining mark version
>> against character classes?
>
> Note that it's generally only the first character in such a sequence
> whose attributes really matter; combining marks are just sort of
> tacked on. If someone cares about such things (e.g. they care about
> splitting off a single "character", including combining marks), they
> can easily handle combining marks at a higher level with LPEG.
Yeah. That would make a good example: how to capture [[:alpha:]]+ in the brave new world, even if restricted to, say, the all-Europe Latin. I would think you'd want a capture of å and å to return the same thing, not stopping at "a" in the second case.
Jay