[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Of Unicode in the next Lua version
- From: Pierre-Yves Gérardy <pygy79@...>
- Date: Sat, 15 Jun 2013 20:13:31 +0200
On Sat, Jun 15, 2013 at 3:52 PM, Roberto Ierusalimschy
<roberto@inf.puc-rio.br> wrote:
>
> You can already easily implement this ǵetchar' in standard Lua (except
> that it assumes a well-formed string):
>
> S = "∂ƒ"
> print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 1)) --> '∂', 4
> print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 4)) --> 'ƒ', 6
> print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 6)) --> nil
Thanks for this pattern trick, it helped me to improve my
`getcodepoint()` routine (although I eventually found a faster
method). A validation routine won't be of much help
if you deal with a document that sports multiple encodings.
If you want to validate the characters on the go and always get a position as
second argument, you need something like this:
If the character is not valid, it returns `false, position`. At the
end of the stream, it returns nil, position + 1.
local s_byte, s_match, s_sub = string.byte, string.match, string.sub
function getchar(S, first)
if #S < first then
return nil, first
end
local match, next = S:match("^([^\128-\191][\128-\191]*)()", first)
if not match then
return false, first
end
local first, n = s_byte(match), #match
local success
= first < 0x128 and n == 1
or first < 0x224 and n == 2
or first < 0x240 and n == 3
or first < 0x248 and n == 4
or first < 0x252 and n == 5
or first < 0x254 and n == 6
--UTF-16 surrogate code point checking left out for clarity.
if success then
return match, next
else
return false, first
end
end
or this (idem in Lua 5.1/5.2, but twice as fast in LuaJIT, where
`gmatch()` is not compiled):
function utf8_get_char_jit_valid2(subject, i)
if i > #subject then
return nil, i
end
local byte, len = s_byte(subject,i)
if byte < 128 then
return s_sub(subject, i, i), i + 1
elseif byte < 192 then
return false, i
elseif byte < 224 and s_match(subject, "^[\128-\191]",
i + 1) then
return s_sub(subject, i, i + 1), i + 2
elseif byte < 240 and s_match(subject,
"^[\128-\191][\128-\191]",
i + 1) then
return s_sub(subject, i, i + 2), i + 3
elseif byte < 248 and s_match(subject,
"^[\128-\191][\128-\191][\128-\191]",
i + 1) then
return s_sub(subject, i, i + 3), i + 4
elseif byte < 252 and s_match(subject,
"^[\128-\191][\128-\191][\128-\191][\128-\191]",
i + 1) then
return s_sub(subject, i, i + 4), i + 5
elseif byte < 254 and s_match(subject,
"^[\128-\191][\128-\191][\128-\191][\128-\191][\128-\191]",
i + 1) then
return s_sub(subject, i, i + 5), i + 6
else
return false, i
end
end
This is not that complex, but still rather slow in Lua, and the same
goes for getting the code point to perform a range query (useful to
test if a code point is part of some alphabet).
To that, end, you could provide a `utf8.range(char, lower, upper)`, though.
This assumes you don't deprecate patterns in the next Lua version (or
the one after, to ease the transition?).
But I understand the need to balance features and light weight.
`getchar()` and `getcodepoint()` are damn useful to write parsers, but
if LPeg is part of the next version, the point is probably moot.
-- Pierre-Yves