Re: htmlentities table

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: htmlentities table
From: Rici Lake <lua@...>
Date: Fri, 28 Oct 2005 12:33:51 -0500


On 28-Oct-05, at 11:15 AM, PA wrote:

Assuming a long list of character substitutions (e.g. "Ä" -> "A", etc):

http://dev.alt.textdrive.com/file/lu/LUStringBasicLatin.txt
What would be a reasonable implementation to actually perform thesubstitutions?


I thought I'd posted this recently.

A pattern which works is "[^\128-\191][\128-\191]*"

This won't pickup invalid utf8 sequences, but if the string is validutf8, that will pick up one utf8 sequence, so you can use it to iterateover the utf8 string with gsub or gfind.

If you look at the definition of utf-8 you'll see why it works; a utf-8sequence is either a single 7-bit byte (i.e. < 128) or a first byte(actually in the range 0xC2-0xF4, but for simplicity we can say >=0xC0) followed by a determined number of successor bytes, each of whichcarries 6 bits and is in the range 0x80-0xBF, or 128-191. The key isthat all but the first byte are 0x80-0xBF, and the first byte is notone of those, which is exactly what the above pattern says.

You can use an even simpler one to count the number of utf-8 sequencesin a string; just count the number of non-successor bytes([^\128-\191]).

It is quite easy to actually verify the string's utf-8 validity aswell, but I'll leave that as an exercise for the reader. It is one ofthe use cases I have for the modification to Mike Pall's gsub patch Iposted a few days ago, since it involves using the head character ofthe sequence to select a function which matches the remainder of thesequence against one of half a dozen or so validators.

Follow-Ups:
- Re: htmlentities table, David Given
- Re: htmlentities table, PA

References:
- htmlentities table, Walter Cruz
- Re: htmlentities table, Rici Lake
- Re: htmlentities table, PA

Prev by Date: Re: htmlentities table
Next by Date: Re: Possibility of buiklding LuaInterface for Windows Mobile
Previous by thread: Re: htmlentities table
Next by thread: Re: htmlentities table
Index(es):
- Date
- Thread