[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: htmlentities table
- From: Rici Lake <lua@...>
- Date: Fri, 28 Oct 2005 20:36:46 -0500
On 28-Oct-05, at 6:56 PM, David Given wrote:
On Friday 28 October 2005 22:13, Rici Lake wrote:
[...]
The full pattern: [^\128-\191][\128-\191]
[...]
matches:
"Not a continuation byte" followed by 0 or more "continuation bytes"
Should there be a * on the end of that pattern? Because what you wrote
matches
'not a continuation byte' followed by 'exactly one continuation byte'.
Quite right, a cut and paste error. The one in the original message was
correct.
Here's some sample code, which simply turns every character in all of
the command line arguments into a U+hex code:
-- Non validating (and potentially faster) implementation
function string.eachutf8(str)
return str:gfind("[^\128-\191][\128-\191]*")
end
local prefix = {}
for i = 0, 127 do prefix[i] = i end
for i = 194, 223 do prefix[i] = i - 192 end
for i = 224, 239 do prefix[i] = i - 224 end
for i = 240, 244 do prefix[i] = i - 240 end
function string.toucode(seq)
local accum = prefix[seq:byte(1)]
for i = 2, #seq do
accum = accum * 64 + seq:byte(i) % 64
end
return accum
end
for utf8seq in table.concat(arg, " "):eachutf8() do
io.write(("U+%X "):format(utf8seq:toucode()))
end
io.write("\n")
Note that only three lines of this code are the actually library
function :) The following test has the peculiar usage of `cat` in order
to let me type the test line without ncurses, which has no unicode
support on the OS I use (although the terminal does):
rlake@freeb:~/xml/lualib$ lua51 quickutf8.lua `cat`
Mañana. ЖЄЫЩ ฌญมかぢに∀x.x∌ℜ
U+4D U+61 U+F1 U+61 U+6E U+61 U+2E U+20 U+416 U+404 U+42B U+429 U+20
U+E0C U+E0D U+E21 U+304B U+3062 U+306B U+2200 U+78 U+2E U+78 U+220C
U+211C
I don't speak any of the languages except the first one, so I hope I
haven't committed any major faux pas by twiddling at the keyboard.