|
The real problem is badly-formed UTF-8 .. and there is too much of it to just bail with errors. Some common oddities I have encountered: -- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying code point) -- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM) -- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes instead of 4) To be honest, I'm not sure how I would approach an "IsValidUTF8()" function .. I always tend to fall back on the original TCP/IP philosophy: be rigorous in what you generate, and forgiving in what you accept. --Tim On Jun 15, 2013, at 1:08 PM, Jay Carlson <nop@nop.com> wrote: I don't understand where "false" instead of an error would be useful. Once you've decided to iterate over a string as UTF-8, it is a surprise when the string turns out not to be UTF-8, and it's unlikely your code will do anything useful. There could be a separate utf8.isvalid(s, [byteoffset [, bytelen]]) for when you're testing. |