[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Of Unicode in the next Lua version
- From: Jay Carlson <nop@...>
- Date: Sat, 15 Jun 2013 16:08:58 -0400
On Jun 15, 2013, at 2:13 PM, Pierre-Yves Gérardy wrote:
> On Sat, Jun 15, 2013 at 3:52 PM, Roberto Ierusalimschy
> <roberto@inf.puc-rio.br> wrote:
>>
>> You can already easily implement this ǵetchar' in standard Lua (except
>> that it assumes a well-formed string):
>>
>> S = "∂ƒ"
>> print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 1)) --> '∂', 4
>> print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 4)) --> 'ƒ', 6
>> print(string.match(S, "([^\x80-\xbf][\x80-\xbf]*)()", 6)) --> nil
>
> Thanks for this pattern trick, it helped me to improve my
> `getcodepoint()` routine (although I eventually found a faster
> method). A validation routine won't be of much help
> if you deal with a document that sports multiple encodings.
I tend to think, "strings are components of documents"; when using the XML Infoset as a model, encodings can be abstracted away. But concrete strings are different. In general, people don't have individual strings with multiple encodings; people have bugs.[1]
> If you want to validate the characters on the go and always get a position as
> second argument, you need something like this:
>
> If the character is not valid, it returns `false, position`. At the
> end of the stream, it returns nil, position + 1.
I don't understand where "false" instead of an error would be useful. Once you've decided to iterate over a string as UTF-8, it is a surprise when the string turns out not to be UTF-8, and it's unlikely your code will do anything useful. There could be a separate utf8.isvalid(s, [byteoffset [, bytelen]]) for when you're testing.
I am one of those "assert everything" fascists, though. Code encountering "false" in place of an expected string often blows up anyway (although convenience functions which auto-coerce to string can hide that). The question is how promptly the error is signaled.
> --UTF-16 surrogate code point checking left out for clarity.
...plus the stuff over U+10FFFF...
> This is not that complex, but still rather slow in Lua, and the same
> goes for getting the code point to perform a range query (useful to
> test if a code point is part of some alphabet).
>
> To that, end, you could provide a `utf8.range(char, lower, upper)`, though.
UTF-8 is constructed such that Unicode code points are ordered lexicographically under 8-bit strcmp. So you can replace that with
function utf8.inrange(str single_codepoint, str lower_codepoint, str upper_codepoint)
return single_codepoint >= lower_codepoint and single_codepoint <= upper_codepoint;
end
and you don't need to extract the codepoint from a longer string if you write "< upper_codepoint_plus_one"; this lets you test an arbitrary byte offset for range membership. All of these nice properties go to hell if invalid UTF-8 creeps in, though.
Languages like Lua tend to be very slow when operating character by character. I think there is some kind of map/collect primitive for working with codepoints which probably needs to be in C for speed. Because so many functions on Unicode are sparse, something like map_table[page][offset] is useful, especially if those tables have metatables which can answer with a function and optionally populate them lazily.
Jay
[1]: If somebody hands you a URL, you can't round-trip the %-encoding through Unicode; it must be preserved as US ASCII. Casual URL manipulation is full of string-typing bugs. I wrote a web server which used %23 in URLs. Broken web proxies would unescape the string, notice that there was now a "#foo", and truncate the URL at the mistaken fragment identifier.