Re: Matching multibyte alphabetical characters with LPeG

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Matching multibyte alphabetical characters with LPeG
From: Jay Carlson <nop@...>
Date: Mon, 18 Jun 2012 22:59:24 -0400

On Jun 17, 2012, at 7:13 PM, Miles Bader wrote:

> Jay Carlson <nop@nop.com> writes:

>> sunk-cost:slnunicode-1.1a nop$ size slnudata.o
>> __TEXT	__DATA	__OBJC	others	dec	hex
>> 0	14012	0	0	14012	36bc
>> 
>> No, it does not provide enough to write a bidi renderer, but it does
>> characterize each code point as one of 30 classes--and includes
>> toupper/tolower/totitlecase.
>> 
>> http://files.luaforge.net/releases/sln/slnunicode
> 
> Hmm, slnunicode seems to use clever techniques to compress the table
> (note, though, that it only supports the BMP [16-bit characters],
> which is kind of a lose ... this is 2012, people!).

Yeah, I saw the BMP limitation and sighed. The compression is clever, but it does depend on there not being very much information stored.

> Still, it's functionality that's best left to a separate library, not
> something that should be in LPEG.

I don't think it is unreasonable to ask for a capture of a run of alphabetic grapheme clusters; this is what isalpha() does for "å" in single-byte locales. You will note the name of the author of the original post in this thread can't be written in ASCII but can be in Latin-1. UTF-8 is making Western Europe and the (non-US) Americas suffer the same problems the rest of the world deals with all the time. Leaning on precomposed characters seems like the same kind of restriction as the BMP.

I do think there is something distasteful about keeping the table itself in the core system. It would be nice to separate the mechanism from the data.

>> There's still the grapheme problem for å vs å; hopefully you can't
>> tell the second is "a".."␣̊". [1]
>> 
>> How should lpeg match the one with a separate combining mark version
>> against character classes?
> 
> Note that it's generally only the first character in such a sequence
> whose attributes really matter; combining marks are just sort of
> tacked on.  If someone cares about such things (e.g. they care about
> splitting off a single "character", including combining marks), they
> can easily handle combining marks at a higher level with LPEG.

Yeah. That would make a good example: how to capture [[:alpha:]]+ in the brave new world, even if restricted to, say, the all-Europe Latin. I would think you'd want a capture of å and å to return the same thing, not stopping at "a" in the second case.

Jay

References:
- Matching multibyte alphabetical characters with LPeG, Hinrik Örn Sigurðsson
- Re: Matching multibyte alphabetical characters with LPeG, Miles Bader
- Re: Matching multibyte alphabetical characters with LPeG, Jay Carlson
- Re: Matching multibyte alphabetical characters with LPeG, Miles Bader

Prev by Date: Re: some reflection
Next by Date: Problems using luarocks
Previous by thread: Re: Matching multibyte alphabetical characters with LPeG
Next by thread: Re: Matching multibyte alphabetical characters with LPeG
Index(es):
- Date
- Thread