[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: LPeg support for utf-8
- From: Tony Finch <dot@...>
- Date: Mon, 4 Apr 2011 11:00:55 +0100
Roberto Ierusalimschy <roberto@inf.puc-rio.br> wrote:
>
> This is more ore less what I had in mind (specific names not
> withstanding). But still remains the question of whether each of these
> constructions (uS, uR, etc.) is really useful and whether there should
> be others.
I think the main thing that is difficult for an lpeg user to do and which
lpeg itself should do reasonably well is union and intersection on sets of
unicode code points encoded in UTF-8. Hence uS and uR being my first
suggestions.
At the moment lpeg optimizes certain pattern combinations into character
set operations. I wonder if the pattern optimizer is powerful enough to do
this efficiently for patterns that match UTF-8 encoded unicode character
sets, or if it needs some extra logic to optimize matches across sets of
the next 1-4 octets.
> For instance, would it be worth to support something like properties
> (using wctype)?
I didn't suggest this because I don't think standard APIs give you
convenient access to a unicode character properties table, which would
make it difficult to compile character property matches efficiently.
Perhaps lpeg could come with a separate sub-module containing patterns for
unicode properties - separated out for size reasons. PCRE's ucptable.c is
over 440 KiBytes source and compiles to 88.5 KiBytes data.
> Or a capture that matches one code point and catures its value?
Could be useful, yes.
Brief rant follows: One thing I really hate about the standards for UTF-8
is that they require applications to break (e.g. lose data or abort) when
they encounter an invalid code point. This is very very bad. It would be
much better if there were some code points reserved for invalid octets in
UTF-8 data, so that if an app were mistakenly given a binary data stream
or some slightly corrupt data or some ISO 8859 text, it could at least
pass it through unscathed. Bah.
Tony.
--
f.anthony.n.finch <dot@dotat.at> http://dotat.at/
South-east Iceland: Variable 4, becoming easterly or southeasterly, then
cyclonic 6 to gale 8, occasionally severe gale 9 later. Very rough or high.
Rain or squally showers. Moderate or good.