Re: Changes in the validation of UTF-8

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Changes in the validation of UTF-8
From: Dirk Laurie <dirk.laurie@...>
Date: Sun, 17 Mar 2019 10:57:08 +0200

Op So. 17 Mrt. 2019 om 09:03 het Andrew Gierth
<andrew@tao11.riddles.org.uk> geskryf:
>
> >>>>> "Dirk" == Dirk Laurie <dirk.laurie@gmail.com> writes:
>
>  >> I noticed the new commit that adds support for longer (deprecated in
>  >> 2003) utf8 sequences:
>  >> https://github.com/lua/lua/commit/1e0c73d5b643707335b06abd2546a83d9439d14c
>  >>
>  >> I'm curious why this changed? It seems like a backwards step to me.
>
>  Dirk> It seems to be in line with Lua's philosophy of providing
>  Dirk> capability, not policy.
>
> When you're validating against an externally defined interchange
> standard, then accepting data that the actual standard explicitly
> rejects IS a lack of capability.

Lua in no way even comes close to validating against the current UTF-8
standard. We've been through this before. Marc Balmer in particular
has been quite trenchant on this point.

All that Lua does is to verify that a string satisfies the basic UTF-8
encoding: ASCII or a starting byte whose introductory string of 1's
says how many bytes in total are being encoded, followed by the right
number of 10... bytes. It's quick-and-dirty, and it doesn't get less
dirty by patching over one of the many ways in which it falls short.

There's no substitute for loading a genuine standard-conforming UTF8
library. Luarocks offers four; hopefully at least one is kept up to
date.

Follow-Ups:
- Re: Changes in the validation of UTF-8, Andrew Gierth
- Re: Changes in the validation of UTF-8, Jay Carlson

References:
- Changes in the validation of UTF-8, Daurnimator
- Re: Changes in the validation of UTF-8, Dirk Laurie
- Re: Changes in the validation of UTF-8, Andrew Gierth

Prev by Date: Re: Changes in the validation of UTF-8
Next by Date: Re: Changes in the validation of UTF-8
Previous by thread: Re: Changes in the validation of UTF-8
Next by thread: Re: Changes in the validation of UTF-8
Index(es):
- Date
- Thread