[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Changes in the validation of UTF-8
- From: Roberto Ierusalimschy <roberto@...>
- Date: Wed, 20 Mar 2019 10:58:58 -0300
> What I think is a backwards step, is the lexer accepting "\u{110000}"
> Unicode escapes >10FFFF should really be an error IMO.
If you write "\u{110000}", you are explicitly asking for an
invalid code. If you want invalid codes, you might as well write
"\xf4\x90\x80\x80" or "\244\144\128\128". "\u{110000}" just makes
it easier.
(But "\u{110000}" is hardly as useful as utf8.char(0x110000).
I was wondering about removing this laxity in \u; but then surrogates
should be invalid too. Is that good?)
> UTF8PATT accepting deprecated 5 and 6 byte sequences is a similarly
> undesirable change.
UTF8PATT already accepted all kinds of wrong stuff, including
overlonging sequences. 5 and 6 byte sequences is the least of the
problems here. The documentation is (and was) clear that you should use
it only on valid strings.
> Accepting unpaired surrogates isn't odd, and is unfortunately required
> when working with many badly designed APIs (e.g. windows file paths,
> javascript). utf-8 with unpaired surrogates allowed is often called
> "wtf-8". https://simonsapin.github.io/wtf-8/
That's the whole point: It is useful to be able to work with invalid
codes. Why is 110000 "more invalid" than a surrogate? If you are going
to accept surrogates, why not do go the whole way and accept what UTF-8
was designed for?
-- Roberto