[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 validation
- From: "Cezary H. Noweta" <chn@...>
- Date: Thu, 10 Dec 2015 04:49:29 +0100
On 2015-12-10 03:56, Jay Carlson wrote:
On 2015-12-09, at 9:32 PM, Jonathan Goble <jcgoble3@gmail.com> wrote:
On Wed, Dec 9, 2015 at 9:29 PM, Jay Carlson <nop@nop.com> wrote:
Given a string where is_utf8(s) is false, it might be nice to be able to find the byte offset of the first non-UTF-8 sequence.
utf8.len() already does this.
utf8.len doesn't match the standard definition of UTF-8. Consider this example of an invalid sequence, taken from https://tools.ietf.org/html/rfc3629#section-4 :
utf8.len() should stay as it is a fast version for well-formed strings.
There is no need to use heavy validators if we know that a given string
is valid (or ,,lightly'' ill-formed). utf8.len() correctly returns error
if it does not know what to do with a supplied data. I think, the
returned error state is intended to say ``hey, I don't know what to do
with your data'', rather then to check a validity of utf-8 strings.
Scenario should be:
1) Make sure the string is valid.
2) Do something fast to a valid string.
3) Do something other fast to a valid string.
...
If there are no intervening areas where the string could be invalidated
it is waste of time to validificate the string each time.
For Jay's idea: (1) let utf8.validate() return:
str false --> if there was not ill-formed (number 0 is not false)
str number --> number of first invalid byte (in src str)
--> if there was ill-formed
and/or (2) third parameter (flags in one integer parameter?) stoponerror
- if somebody want to write his own make_safe, then it is good idea to
have the first well-formed part of string instead of a whole
validificated string.
???
-- best regards
Cezary H. Noweta
- References:
- UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Cezary H. Noweta
- Re: UTF-8 validation, Javier Guerra Giraldez
- Re: UTF-8 validation, Coda Highland
- Re: UTF-8 validation, Jay Carlson
- Re: UTF-8 validation, Jonathan Goble
- Re: UTF-8 validation, Jay Carlson