[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Lua utf8.len violates RFC 3629? (was Re: [PATCH] Quoted String "%q" non-ascii escaping (w/ hex).)
- From: Jay Carlson <nop@...>
- Date: Fri, 30 Jun 2017 13:44:27 -0400
Let me back up a second. I believe
o Lua should be able to work with UTF-8 in some useful way;
o The support is for UTF-8, not Unicode;
o UTF-8 is an encoding for data exchange between systems, and is currently defined by RFC 3629;
o The core can't provide any support like string.lower outside of US-ASCII;
o Single-byte character sets like ISO-8859-1 may happen to work with your C locale functions, but that’s not a promise.
Please let me know if these beliefs are wrong.
On Jun 29, 2017, at 7:45 PM, Duane Leslie <parakleta@darkreality.org> wrote:
> Also, I have noticed that the `utf8_decode` function passes the UTF-16 surrogates which are illegal codepoints, so this might also need to be fixed.
Ahh, now I remember why I kept my own UTF-8 validator. Lua’s behavior seems out of conformance with RFC 3629, and this isn’t just a SHOULD in the RFC, it’s a MUST.
Quoting https://tools.ietf.org/html/rfc3629#section-3 :
> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF [...]
>
> Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4.
Let’s see how we’re doing.
function u(t)
if t.rfc then print(“MUST from RFC 3629:") end
print(t[1], utf8.len(t[2]))
print("expected", table.unpack(t.expect))
print()
end
u{"zero, encoded", "\xC0\x80",
expect={nil, 1}, rfc=true}
u{"bad CESU pair", "\xED\xA1\x8C\xED\xBE\xB4",
expect={nil, 1}, rfc=true}
u{"half pair", "\xED\xA1\x8C",
expect={nil, 1}}
u{"half plus A", "\xED\xA1\x8C" .. "A",
expect={nil, 1}}
u{"astral char", "\xEF\xBB\xBF\xF0\xA3\x8E\xB4",
expect={1}, rfc=true}
===
MUST from RFC 3629:
zero, encoded nil 1
expected nil 1
MUST from RFC 3629:
bad CESU pair 2
expected nil 1
half pair 1
expected nil 1
half plus A 2
expected nil 1
MUST from RFC 3629:
astral char 2
expected 1
===
Note that the surrogate behavior is explicitly called out in RFC 3629’s “Security Considerations”, https://tools.ietf.org/html/rfc3629#section-10 .
--
Jay Carlson
nop@nop.com