[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: htmlentities table
- From: Rici Lake <lua@...>
- Date: Fri, 28 Oct 2005 12:33:51 -0500
On 28-Oct-05, at 11:15 AM, PA wrote:
Assuming a long list of character substitutions (e.g. "Ä" -> "A", etc):
http://dev.alt.textdrive.com/file/lu/LUStringBasicLatin.txt
What would be a reasonable implementation to actually perform the
substitutions?
I thought I'd posted this recently.
A pattern which works is "[^\128-\191][\128-\191]*"
This won't pickup invalid utf8 sequences, but if the string is valid
utf8, that will pick up one utf8 sequence, so you can use it to iterate
over the utf8 string with gsub or gfind.
If you look at the definition of utf-8 you'll see why it works; a utf-8
sequence is either a single 7-bit byte (i.e. < 128) or a first byte
(actually in the range 0xC2-0xF4, but for simplicity we can say >=
0xC0) followed by a determined number of successor bytes, each of which
carries 6 bits and is in the range 0x80-0xBF, or 128-191. The key is
that all but the first byte are 0x80-0xBF, and the first byte is not
one of those, which is exactly what the above pattern says.
You can use an even simpler one to count the number of utf-8 sequences
in a string; just count the number of non-successor bytes
([^\128-\191]).
It is quite easy to actually verify the string's utf-8 validity as
well, but I'll leave that as an exercise for the reader. It is one of
the use cases I have for the modification to Mike Pall's gsub patch I
posted a few days ago, since it involves using the head character of
the sequence to select a function which matches the remainder of the
sequence against one of half a dozen or so validators.