Re: Of Unicode in the next Lua version

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Of Unicode in the next Lua version
From: Tim Hill <drtimhill@...>
Date: Sat, 15 Jun 2013 13:37:25 -0700

The real problem is badly-formed UTF-8 .. and there is too much of it to just bail with errors. Some common oddities I have encountered:

-- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying code point)

-- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)

-- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes instead of 4)

To be honest, I'm not sure how I would approach an "IsValidUTF8()" function .. I always tend to fall back on the original TCP/IP philosophy: be rigorous in what you generate, and forgiving in what you accept.

--Tim

On Jun 15, 2013, at 1:08 PM, Jay Carlson <nop@nop.com> wrote:

I don't understand where "false" instead of an error would be useful. Once you've decided to iterate over a string as UTF-8, it is a surprise when the string turns out not to be UTF-8, and it's unlikely your code will do anything useful. There could be a separate utf8.isvalid(s, [byteoffset [, bytelen]]) for when you're testing.

Follow-Ups:
- Re: Of Unicode in the next Lua version, Paul K
- Re: Of Unicode in the next Lua version, Jay Carlson
- Re: Of Unicode in the next Lua version, David Heiko Kolf

References:
- Of Unicode in the next Lua version, Pierre-Yves Gérardy
- Re: Of Unicode in the next Lua version, Roberto Ierusalimschy
- Re: Of Unicode in the next Lua version, Pierre-Yves Gérardy
- Re: Of Unicode in the next Lua version, Jay Carlson

Prev by Date: Re: Of Unicode in the next Lua version
Next by Date: Re: Of Unicode in the next Lua version
Previous by thread: Re: Of Unicode in the next Lua version
Next by thread: Re: Of Unicode in the next Lua version
Index(es):
- Date
- Thread