[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Of Unicode in the next Lua version
- From: Roberto Ierusalimschy <roberto@...>
- Date: Sat, 15 Jun 2013 10:52:01 -0300
> It is efficient, and often practical, to deal with byte indices, even
> in Unicode strings. It is the approach taken by Julia, and I use it in
> LuLPeg. The API is simple:
>
> char, next_pos = getchar(subject, position)
>
> S = "∂ƒ"
> getchar(S, 1) --> '∂', 4
> getchar(S, 4) --> 'ƒ', 6
> getchar(S, 6) --> nil, nil
>
> A similar function could return code points instead of strings.
>
> What do you think about this?
You can already easily implement this ǵetchar' in standard Lua (except
that it assumes a well-formed string):
S = "∂ƒ"
print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 1)) --> '∂', 4
print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 4)) --> 'ƒ', 6
print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 6)) --> nil
PiL3 discusses other patterns like this one.
Of course, the pattern "([^\x80-\xbf][\x80-\xbf]*)()" is not everyone's
cup of tea, but it does not sound very Lua-ish to "duplicate" this
functionality in a new function. Maybe we could have some patterns like
this one predefined in the library, like this:
string.match(S, utf8.onechar, 1)
(The new library certainly would have a function to check whether a
utf-8 string is well formed.)
-- Roberto