[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Changes in the validation of UTF-8
- From: Dirk Laurie <dirk.laurie@...>
- Date: Sun, 17 Mar 2019 10:57:08 +0200
Op So. 17 Mrt. 2019 om 09:03 het Andrew Gierth
<andrew@tao11.riddles.org.uk> geskryf:
>
> >>>>> "Dirk" == Dirk Laurie <dirk.laurie@gmail.com> writes:
>
> >> I noticed the new commit that adds support for longer (deprecated in
> >> 2003) utf8 sequences:
> >> https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c
> >>
> >> I'm curious why this changed? It seems like a backwards step to me.
>
> Dirk> It seems to be in line with Lua's philosophy of providing
> Dirk> capability, not policy.
>
> When you're validating against an externally defined interchange
> standard, then accepting data that the actual standard explicitly
> rejects IS a lack of capability.
Lua in no way even comes close to validating against the current UTF-8
standard. We've been through this before. Marc Balmer in particular
has been quite trenchant on this point.
All that Lua does is to verify that a string satisfies the basic UTF-8
encoding: ASCII or a starting byte whose introductory string of 1's
says how many bytes in total are being encoded, followed by the right
number of 10... bytes. It's quick-and-dirty, and it doesn't get less
dirty by patching over one of the many ways in which it falls short.
There's no substitute for loading a genuine standard-conforming UTF8
library. Luarocks offers four; hopefully at least one is kept up to
date.