Re: Changes in the validation of UTF-8

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Changes in the validation of UTF-8
From: Roberto Ierusalimschy <roberto@...>
Date: Wed, 20 Mar 2019 10:58:58 -0300

> What I think is a backwards step, is the lexer accepting "\u{110000}"
> Unicode escapes >10FFFF should really be an error IMO.

If you write "\u{110000}", you are explicitly asking for an
invalid code. If you want invalid codes, you might as well write
"\xf4\x90\x80\x80" or "\244\144\128\128". "\u{110000}" just makes
it easier.
(But "\u{110000}" is hardly as useful as utf8.char(0x110000).
I was wondering about removing this laxity in \u; but then surrogates
should be invalid too. Is that good?)


> UTF8PATT accepting deprecated 5 and 6 byte sequences is a similarly
> undesirable change.

UTF8PATT already accepted all kinds of wrong stuff, including
overlonging sequences. 5 and 6 byte sequences is the least of the
problems here. The documentation is (and was) clear that you should use
it only on valid strings.


> Accepting unpaired surrogates isn't odd, and is unfortunately required
> when working with many badly designed APIs (e.g. windows file paths,
> javascript). utf-8 with unpaired surrogates allowed is often called
> "wtf-8". https://simonsapin.github.io/wtf-8/

That's the whole point: It is useful to be able to work with invalid
codes.  Why is 110000 "more invalid" than a surrogate? If you are going
to accept surrogates, why not do go the whole way and accept what UTF-8
was designed for?

-- Roberto

References:
- Changes in the validation of UTF-8, Daurnimator
- Re: Changes in the validation of UTF-8, Roberto Ierusalimschy
- Re: Changes in the validation of UTF-8, Andrew Gierth
- Re: Changes in the validation of UTF-8, Roberto Ierusalimschy
- Re: Changes in the validation of UTF-8, Daurnimator

Prev by Date: Re: Changes in the validation of UTF-8
Next by Date: Looking for Lua Binaries for MSVC 2017
Previous by thread: Re: Changes in the validation of UTF-8
Next by thread: Re: Changes in the validation of UTF-8
Index(es):
- Date
- Thread