[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Of Unicode in the next Lua version
- From: Pierre-Yves Gérardy <pygy79@...>
- Date: Fri, 14 Jun 2013 00:29:29 +0200
On Thu, Jun 13, 2013 at 11:36 PM, Jay Carlson <nop@nop.com> wrote:
> On Jun 12, 2013, at 9:53 AM, Pierre-Yves Gérardy wrote:
> Don't forget getchar(S, 2) -> error("not defined at position 2"). I really like Julia's idea of strings as partial functions.
I'd prefer getchar(S, 2) --> false, 3.
>> A similar function could return code points instead of strings.
>
> Would you use that much?
Yes, before I broke Unicode support in LuLPeg, that's what I was
using. It allows to check if a character is in a given range, and it
is barely slower than returning a sub-string (doing the conversion in
Lua). In LuaJIT, computing the code point with standard arithmetic
(mod, division and floor) is faster than getting the sub-string. It
should be even faster by using the bit library.
> Miles Bader pointed out a lot of string iteration code is phrased in terms of gmatch--or should be. And in that case, there are no string positions at all.
Well, in my case, it isn't, but an LPeg clone is probably not usual in
terms of string processing.
> The major problem for UTF-8 then would be convincing the pattern matcher to consume an entire UTF-8 sequence for ".".
In the 2012 Workshop presentation, Roberto talks about deprecating the
old patterns, so unicode in gmatch will probably never see the light
of day... I don't know if/how he plans to handle Unicode in LPeg.
As posted in the other thread, I plan to tackle this in LuLPeg with
P8(), R8() and S8(), that will live alongside their byte-matching
cousins.
-- Pierre-Yves