[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 validation
- From: "Cezary H. Noweta" <chn@...>
- Date: Thu, 10 Dec 2015 07:22:04 +0100
On 2015-12-10 04:49, Cezary H. Noweta wrote:
On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:
Given a string where is_utf8(s) is false, it might be nice to be
able to find the byte offset of the first non-UTF-8 sequence.
For Jay's idea: (1) let utf8.validate() return:
str false --> if there was not ill-formed (number 0 is not false)
str number --> number of first invalid byte (in src str)
--> if there was ill-formed
and/or (2) third parameter (flags in one integer parameter?) stoponerror
- if somebody want to write his own make_safe, then it is good idea to
have the first well-formed part of string instead of a whole
validificated string.
OK - now the function returns:
str false --> if every thing is ok
str number --> if there was an error;
number is position in the source string
of invalid character
There is the third parameter which (when true) causes exit
if an invalid char has been encountered. (Returned string has valid
characters --- including reencoded NULs and surrogates if any --- until
that point).
utf8.validate(s [, allowlongnul [, allowsurrogates [, stoponerror]]])
utf8.validate('abc\xC0\x80def'); --> 'abcdef' 4
utf8.validate('abc\xC0\x80def', true); --> 'abc\x00def' false
utf8.validate('abc\xC0\x80def', false, false, true); --> 'abc' 4
utf8.validate('\xED\xA0\x80\xED\xB0\x80', false, true);
--> '\xF0\x90\x80\x80' false
http://lua.chncc.eu/utf8/201512100653/lutf8lib.c
MD5 33e229ccb8199ece764bf6eef3f8c00a
-- best regards
Cezary H. Noweta
- References:
- UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Javier Guerra Giraldez
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Jay Carlson
- Re: UTF-8 validation, Jonathan Goble
- Re: UTF-8 validation, Jay Carlson
- Re: UTF-8 validation, Cezary H. Noweta