lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On 9/4/2015 2:38 PM, Coda Highland wrote:
Besides, the standard maxim in these cases is "be liberal in what you
accept; be conservative in what you send." Why should you throw an
error when reading data that diverges from the standard if the result
is still meaningful? Sure, don't GENERATE these UTF-8 codes, but don't
barf on them either.

While I endorse the maxim most of the time, the restrictions in the definition of UTF-8 are there for a specific reason: to require that each valid Unicode code point have exactly one valid UTF-8 representation. That is part of a defense in depth approach to preventing abuses that could occur if it were legal to write U+000000 as anything other than the single byte 0x00, or two disguise other semantically interesting characters with names other than their usual representation.

That said, the mapping of bits used by UTF-8 does naturally extend to allow representation of all 32-bit values including halves of surrogate pairs (or complete pairs) and values beyond the defined range of Unicode code points. Given that Lua has historically treated strings as (mostly) opaque blobs, it does seem reasonable for it to be allowed to do the same with "utf8".

Both goals could be achieved with a library routine that validates that a given utf8 string is also valid UTF-8, perhaps returning flags for the kinds of violations it found rather than just nil or false on failure. It could even optionally repair the string by merging surrogate pairs or rewriting longer sequences to the shortest possible sequence. But such repair is exactly the case where you must be concerned that you are not creating the very kind of attack opportunity that was defended against by the stricter rules.

--
Ross Berteig                               Ross@CheshireEng.com
Cheshire Engineering Corp.           http://www.CheshireEng.com/