[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 validation
- From: Jay Carlson <nop@...>
- Date: Wed, 9 Dec 2015 21:56:56 -0500
On 2015-12-09, at 9:32 PM, Jonathan Goble <jcgoble3@gmail.com> wrote:
>
> On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:
>> Given a string where is_utf8(s) is false, it might be nice to be able to find the byte offset of the first non-UTF-8 sequence.
>
> utf8.len() already does this.
utf8.len doesn't match the standard definition of UTF-8. Consider this example of an invalid sequence, taken from https://tools.ietf.org/html/rfc3629#section-4 :
Lua 5.3.1 Copyright (C) 1994-2015 Lua.org, PUC-Rio
> utf8.len("\xED\xA1\x8C")
1
The production it's supposed to match is:
> UTF8-octets = *( UTF8-char )
> UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
> UTF8-1 = %x00-7F
> UTF8-2 = %xC2-DF UTF8-tail
> UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
> %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
> UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
> %xF4 %x80-8F 2( UTF8-tail )
> UTF8-tail = %x80-BF
Jay