[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 validation
- From: Jay Carlson <nop@...>
- Date: Thu, 10 Dec 2015 14:40:38 -0500
On 2015-12-09, at 9:32 PM, Jonathan Goble <jcgoble3@gmail.com> wrote:
>
> On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:
>> Given a string where is_utf8(s) is false, it might be nice to be able to find the byte offset of the first non-UTF-8 sequence.
>
> utf8.len() already does this. On an invalid sequence, it returns two
> values: nil plus the byte position of the first invalid sequence. I
> believe this was also mentioned earlier in the thread.
utf8.len does more than is required for a "is_utf8" function, so it may be inefficient to use utf8.len as a predicate.
Consider a simpler predicate:
function has_no_uppercase(s)
return not string.find(s, "[A-Z]")
end
This predicate can be memoized:
lcased = setmetatable({}, {__mode="kv"})
function memoized_has_no_uppercase(s)
local r = lcased[s]
if r ~= nil then return r end
r = has_no_uppercase(s)
lcased[s] = r
return r
end
print( memoized_has_no_uppercase("abcd") ) -- true
print( memoized_has_no_uppercase("abCd") ) -- false
If we happen to have memoized the state of two strings we're concatenating, we also know the state of the result:
function concat_s(s1, s2)
local s = s1..s2
local r1 = lcased[s1]
local r2 = lcased[s2]
if r1 ~= nil and r2 ~= nil then
lcased[s] = r1 and r2
end
return s
end
The following already know their status:
cs0 = concat_s("abcd", "abCd")
cs1 = concat_s("abcd", "abcd")
print( memoized_has_no_uppercase(cs0) ) -- false
print( memoized_has_no_uppercase(cs1) ) -- true
> for k,v in pairs(lcased) do print(k,v) end
abcdabCd false
abcd true
abcdabcd true
abCd false
But the following requires scanning, because the second string has not been seen before:
cs3 = concat_s("abcd", "wxyz")
print( memoized_has_no_uppercase(cs3) )
It is then a more expensive operation to ask "where is the first uppercase character?" than "are there no uppercase characters?" because composite operations closed over lowercase-ness can keep track of this information via the three-valued logic.
The related string category "has_no_high_bits" aka "is_ascii" is interesting because any member of "is_ascii" is also a member of "is_utf8". What's more, it's almost free to calculate "has_no_high_bits" on any hashed string.
The C implementation I've been playing with keeps information on character set membership directly on the Lua TString, so there is no table manipulation. It patches luaV_concat to sum the membership info over the strings it's summing.
Jay
(Please ignore the interaction of strings and weak tables.)