[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Of Unicode in the next Lua version
- From: Jay Carlson <nop@...>
- Date: Sat, 15 Jun 2013 20:13:54 -0400
On Jun 15, 2013, at 6:56 PM, Pierre-Yves Gérardy wrote:
> On Sat, Jun 15, 2013 at 10:08 PM, Jay Carlson <nop@nop.com> wrote:
>> UTF-8 is constructed such that Unicode code points are ordered lexicographically under 8-bit strcmp. So you can replace that with
>>
>> function utf8.inrange(str single_codepoint, str lower_codepoint, str upper_codepoint)
>> return single_codepoint >= lower_codepoint and single_codepoint <= upper_codepoint;
>> end
>
> I hadn't realized this. I'm acreting knowledge on the go, I've yet to
> rigorously explore Unicode... I find UTF-8 beautiful in lots of
> regards. UTF-16 baffles me, though. Do you know why they reserved
> codepoints, which are supposed to correspond to symbols, to the
> implementation details of an encoding? I whish there was a UTF-16'
> that followed the UTF-8 strategy.
Originally, Unicode was sold as "double-wide ASCII". "All you have to do to support the world's scripts is use 2-byte characters." Then they decided 64k codepoints was *not* enough for everyone. Before they ran out of space, they allocated the surrogate blocks to let existing software using 2-byte-characters have access to the other planes. It's a good design given the constraints.
UCS-2 was attractive; all codepoints were a fixed size. Adding UTF-16 to support the astral planes meant codepoints were *not* all the same size, and once you had to deal with that, other variable-width codes like UTF-8 were more competitive.
>> and you don't need to extract the codepoint from a longer string if you write "< upper_codepoint_plus_one"; this lets you test an arbitrary byte offset for range membership.
>
> I don't understand what you mean here :-/
I'm sorry, that really was cryptic. Let's look at an ASCII range tester: is the first character of this string a capital letter?
first_member = "A"
last_member = "Z"
after_last = string.char(string.byte(last_member)+1)
-- outside the range
assert(not( "" >= first_member ))
assert(not( "@home" >= first_member ))
-- inside the range
assert( "Alphabet" >= first_member )
assert( "Yow" < last_member )
assert( "Zymurgy" < after_last )
-- outside
assert(not( "[" < after_last ))
Any string inside the range must be strictly less than the very first value after the range.
This same property is true of UTF-8 strings. Any string starting with a codepoint inside the range will be strictly less than the string consisting of the first codepoint outside the range. You can test membership of the codepoint at any byte-index without extracting it. You could use C code like this:
strcmp(s+7, first_member) >= 0 && strcmp(s+7, after_last) < 0
as long as you know s+7 is a valid offset inside the string. (This mostly works for invalid offsets too.)
Jay