Re: UTF-8 validation

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: UTF-8 validation
From: Jay Carlson <nop@...>
Date: Thu, 10 Dec 2015 14:40:38 -0500

On 2015-12-09, at 9:32 PM, Jonathan Goble <jcgoble3@gmail.com> wrote:
> 
> On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:
>> Given a string where is_utf8(s) is false, it might be nice to be able to find the byte offset of the first non-UTF-8 sequence.
> 
> utf8.len() already does this. On an invalid sequence, it returns two
> values: nil plus the byte position of the first invalid sequence. I
> believe this was also mentioned earlier in the thread.

utf8.len does more than is required for a "is_utf8" function, so it may be inefficient to use utf8.len as a predicate. 

Consider a simpler predicate:

function has_no_uppercase(s) 
   return not string.find(s, "[A-Z]") 
end

This predicate can be memoized:

lcased = setmetatable({}, {__mode="kv"})

function memoized_has_no_uppercase(s)
  local r = lcased[s]
  if r ~= nil then return r end
  r = has_no_uppercase(s)
  lcased[s] = r
  return r
end

print( memoized_has_no_uppercase("abcd") ) -- true
print( memoized_has_no_uppercase("abCd") ) -- false

If we happen to have memoized the state of two strings we're concatenating, we also know the state of the result:

function concat_s(s1, s2)
  local s = s1..s2
  local r1 = lcased[s1]
  local r2 = lcased[s2]
  if r1 ~= nil and r2 ~= nil then
    lcased[s] = r1 and r2
  end
  return s
end

The following already know their status:

cs0 = concat_s("abcd", "abCd")
cs1 = concat_s("abcd", "abcd")
print( memoized_has_no_uppercase(cs0) ) -- false
print( memoized_has_no_uppercase(cs1) ) -- true

> for k,v in pairs(lcased) do print(k,v) end
abcdabCd    false
abcd        true
abcdabcd    true
abCd        false

But the following requires scanning, because the second string has not been seen before:

cs3 = concat_s("abcd", "wxyz")
print( memoized_has_no_uppercase(cs3) )

It is then a more expensive operation to ask "where is the first uppercase character?" than "are there no uppercase characters?" because composite operations closed over lowercase-ness can keep track of this information via the three-valued logic.

The related string category "has_no_high_bits" aka "is_ascii" is interesting because any member of "is_ascii" is also a member of "is_utf8". What's more, it's almost free to calculate "has_no_high_bits" on any hashed string.

The C implementation I've been playing with keeps information on character set membership directly on the Lua TString, so there is no table manipulation. It patches luaV_concat to sum the membership info over the strings it's summing.

Jay

(Please ignore the interaction of strings and weak tables.)

References:
- UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Javier Guerra Giraldez
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Jay Carlson
- Re: UTF-8 validation, Jonathan Goble

Prev by Date: Re: UTF-8 validation
Next by Date: Re: About official Lua's products on Zazzle and CafePress
Previous by thread: Re: UTF-8 validation
Next by thread: Microsoft flags mails from this list as spam
Index(es):
- Date
- Thread