[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: utf8.codes ignores spurious continuation bytes
- From: Roberto Ierusalimschy <roberto@...>
- Date: Mon, 19 Sep 2022 10:06:10 -0300
> Hello Lua-Community,
>
> I have the following question:
>
> Lua 5.4.4 Copyright (C) 1994-2022 Lua.org, PUC-Rio
> > for pos, cp in utf8.codes('in\xbfvalid') do print(pos, cp) end
> 1 105
> 2 110
> 4 118
> 5 97
> 6 108
> 7 105
> 8 100
>
> Any spurious/fake conti-bytes are ignored in utf8.codes.
> https://www.lua.org/manual/5.4/manual.html#pdf-utf8.codes
> says: "It raises an error if it meets any invalid byte sequence."
>
> But in the source
> https://www.lua.org/source/5.4/lutf8lib.c.html
> it seems to me this is done on purpose; in iter_aux
>
> if (n < len) {
> while (iscont(s + n)) n++; /* skip continuation bytes */
> }
>
> Is this done on prupose? Is it supposed to act like this?
> If this is done on purpose, then I misread the manual. Sorry.
> If it's not on purpose, then iter_aux has to be changed, e.g. the 3
> lines above deleted and the "next" result of utf8_decode has to be used
> to update "n" (instead of n+1) a few lines below.
Thanks for the feedback. This is a bug.
Note that the fix you propose doesn't work, because then the iterator
will not return to the program the position of the character being
traversed, but the position of the next one. The loop is there to
go to the next character after each iteration. However, as you mentioned,
it can skip more than intended.
-- Roberto