On Tue, Sep 8, 2015 at 11:39 AM, Ross Berteig <Ross@cheshireeng.com> wrote:
Both goals could be achieved with a library routine that validates that a
given utf8 string is also valid UTF-8, perhaps returning flags for the kinds
of violations it found rather than just nil or false on failure. It could
even optionally repair the string by merging surrogate pairs or rewriting
longer sequences to the shortest possible sequence. But such repair is
exactly the case where you must be concerned that you are not creating the
very kind of attack opportunity that was defended against by the stricter
rules.
This is, in fact, what I had suggested -- a function for validation,
and a function for normalization.
Of note, normalization can in fact be done in a way immune to
malfeasance. What you do with the string AFTER normalization may, of
course, be a risk, but having a syntactic normalization pass before a
subsequent semantic-level validation (that is, not just validating the
UTF-8 string but validating the contents of it) will make it easier to
protect against it, because post-normalization you can be sure that
problematic characters (e.g. control characters or embedded nulls) can
only have one canonical representation.