[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: newbie - Lua and unicode
- From: William Ahern <wahern@...>
- Date: Thu, 14 Sep 2006 13:56:52 -0700
On Thu, 2006-09-14 at 13:30 -0700, William Ahern wrote:
> Here's where the big gotcha comes with Unicode. A code point does not
> equal a "character". In unicode you can compose "characters" (aka
> graphemes), using multiple codepoint entities. An a+umlaut, even though
> it's a latin1 character in the older ISO standards, can be represented
> by one or three 16-bit codepoint values.
>
Actually, there are three ways to represent this on screen, and they're
equivalency is dependent on the application and usage. If I was scanning
logs visually and grepping for a+umlaut, I'd probably want my search key
to match all of these:
1) U+00E4
2) U+0061 U+0308
3) U+0061 U+034F U+0308
These examples are valid in both UCS-2 and UTF-16.
--
William Ahern <wahern@barracudanetworks.com>
--------------------------------------------------
This message was scanned for Spam, Spyware and Viruses
For more information, please visit:
http://www.barracudanetworks.com