[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Of Unicode in the next Lua version
- From: David Heiko Kolf <david@...>
- Date: Sun, 16 Jun 2013 13:30:32 +0200
Tim Hill wrote:
> The real problem is badly-formed UTF-8 .. and there is too much of it to
> just bail with errors. Some common oddities I have encountered:
>
> -- UTF-16 surrogate pairs encoded as UTF-8 (rather then the underlying
> code point)
> -- UTF-16 BOM encoded in UTF-8 (which of course has no use for a BOM)
> -- Non-canonical UTF-8 encodings (for example, encoding in 5 bytes
> instead of 4)
>
> To be honest, I'm not sure how I would approach an "IsValidUTF8()"
> function .. I always tend to fall back on the original TCP/IP
> philosophy: be rigorous in what you generate, and forgiving in what you
> accept.
The BOM in UTF-8 is mainly annoying for plain ASCII applications where
UTF-8 should be transparent in strings. But as far as I remember it is
not invalid UTF-8 (though its only use is to show that text is indeed
UTF-8). An Unicode-aware application can just ignore it.
The last point, the non-canonical UTF-8 encodings, is actually a huge
security risk that already opened holes in the field.
UTF-8 is quite often used as just an extension to ASCII (which it was
meant to be) and so some filters checked that URLs don't use "../" to
access upper directories. They did not check all the non-canonical ways
of encoding dots and slashes so these paths went through the filter. At
some point (I guess just before the OS API) the UTF-8 was converted
"forgivingly" to UTF-16 and suddenly the dangerous paths were used.
That is the reason why the standards say that the conversion of UTF-8 to
codepoints must not tolerate non-canonical encodings but either reject
the string completely or put some codepoint in there that signals an
encoding error (though I do not know which codepoint that was).
Best regards,
David Kolf