Re: Should Lua be more strict about Unicode errors?

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Should Lua be more strict about Unicode errors?
From: Ross Berteig <Ross@...>
Date: Tue, 8 Sep 2015 13:51:30 -0700



On 9/8/2015 12:20 PM, Coda Highland wrote:

On Tue, Sep 8, 2015 at 11:39 AM, Ross Berteig <Ross@cheshireeng.com> wrote:

Both goals could be achieved with a library routine that validates that a
given utf8 string is also valid UTF-8, perhaps returning flags for the kinds
of violations it found rather than just nil or false on failure. It could
even optionally repair the string by merging surrogate pairs or rewriting
longer sequences to the shortest possible sequence. But such repair is
exactly the case where you must be concerned that you are not creating the
very kind of attack opportunity that was defended against by the stricter
rules.


This is, in fact, what I had suggested -- a function for validation,
and a function for normalization.

Of note, normalization can in fact be done in a way immune to
malfeasance. What you do with the string AFTER normalization may, of
course, be a risk, but having a syntactic normalization pass before a
subsequent semantic-level validation (that is, not just validating the
UTF-8 string but validating the contents of it) will make it easier to
protect against it, because post-normalization you can be sure that
problematic characters (e.g. control characters or embedded nulls) can
only have one canonical representation.

Exactly. The concern is to make sure that any normalization occurs*before* any semantic validation at all is done. Every time. Throughoutyour entire system. Otherwise, you run the risk of someone slippingsomething through, and next thing you know little Bobby Tables[1] ownsyour web store...


[1]: https://xkcd.com/327/

Unfortunately, a lot of systems take a very relaxed attitude toseparating semantic content from representation on the wire. I'm stillfighting with WordPress.com to get their parsing of Markdown to behaveconsistently with respect to some characters that need to be presentedas HTML entities, for instance. They clearly normalize text more thanonce, and sometimes validate more than once, with stupidly wrongresults. UTF-8 is at least normalizable in a way that would stabilizeand be immune to further normalization.


--
Ross Berteig                               Ross@CheshireEng.com
Cheshire Engineering Corp.           http://www.CheshireEng.com/

Follow-Ups:
- Re: Should Lua be more strict about Unicode errors?, Coda Highland
- Re: Should Lua be more strict about Unicode errors?, Dirk Laurie

References:
- Re: Should Lua be more strict about Unicode errors?, Jay Carlson
- Re: Should Lua be more strict about Unicode errors?, Dirk Laurie
- Re: Should Lua be more strict about Unicode errors?, Jay Carlson
- Re: Should Lua be more strict about Unicode errors?, Coda Highland
- Re: Should Lua be more strict about Unicode errors?, Ross Berteig
- Re: Should Lua be more strict about Unicode errors?, Coda Highland

Prev by Date: Re: To all Lua rock maintainers (also included considerations on Lua's ecosystem and a Lua distribution)
Next by Date: Re: Should Lua be more strict about Unicode errors?
Previous by thread: Re: Should Lua be more strict about Unicode errors?
Next by thread: Re: Should Lua be more strict about Unicode errors?
Index(es):
- Date
- Thread