Re: UTF-8 validation

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: UTF-8 validation
From: "Cezary H. Noweta" <chn@...>
Date: Thu, 10 Dec 2015 07:22:04 +0100

On 2015-12-10 04:49, Cezary H. Noweta wrote:

On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:

Given a string where is_utf8(s) is false, it might be nice to be
able to find the byte offset of the first non-UTF-8 sequence.

For Jay's idea: (1) let utf8.validate() return:

str false --> if there was not ill-formed (number 0 is not false)

str number --> number of first invalid byte (in src str)
            --> if there was ill-formed

and/or (2) third parameter (flags in one integer parameter?) stoponerror
- if somebody want to write his own make_safe, then it is good idea to
have the first well-formed part of string instead of a whole
validificated string.


OK - now the function returns:

str false --> if every thing is ok

str number --> if there was an error;
               number is position in the source string
               of invalid character

There is the third parameter which (when true) causes exit

if an invalid char has been encountered. (Returned string has validcharacters --- including reencoded NULs and surrogates if any --- untilthat point).


utf8.validate(s [, allowlongnul [, allowsurrogates [, stoponerror]]])

utf8.validate('abc\xC0\x80def'); --> 'abcdef' 4

utf8.validate('abc\xC0\x80def', true); --> 'abc\x00def' false

utf8.validate('abc\xC0\x80def', false, false, true); --> 'abc' 4

utf8.validate('\xED\xA0\x80\xED\xB0\x80', false, true);
  --> '\xF0\x90\x80\x80' false

http://lua.chncc.eu/utf8/201512100653/lutf8lib.c

MD5 33e229ccb8199ece764bf6eef3f8c00a

-- best regards

Cezary H. Noweta

Follow-Ups:
- Re: UTF-8 validation, Hisham

References:
- UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Javier Guerra Giraldez
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Jay Carlson
- Re: UTF-8 validation, Jonathan Goble
- Re: UTF-8 validation, Jay Carlson
- Re: UTF-8 validation, Cezary H. Noweta

Prev by Date: Re: UTF-8 validation
Next by Date: Re: UTF-8 validation
Previous by thread: Re: UTF-8 validation
Next by thread: Re: UTF-8 validation
Index(es):
- Date
- Thread