[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: UTF-8 validation
- From: "Cezary H. Noweta" <chn@...>
- Date: Thu, 10 Dec 2015 00:32:42 +0100
On 2015-12-09 23:58, Coda Highland wrote:
utf8.len() will return false and the position of the first invalid
byte for an invalid UTF-8 string.
Indeed, however my function's purpose is not testing if a string is
valid but the following flow:
[unknown string] => [black box] => [valid string].
in one simple step. This comes from an Unicode's recommendation. After
that I know that there are no 4/6-byte backslashes or quotes for a
SQLinj and other fancy pitfalls.
Today, non-shortest forms are very dangerous - Lua's utf8_decode is
susceptible to this (there is no need to correct this as long as a
string is valid). Conciseness of UTF-8 allows to treat strings as plain
ASCII ones - it is frequently used and can be very danger.
The first thing to do with an unknown string (just after its length is
determined) is to validate it. After you have treated a string by my
utf8.validate, you can apply less secure but very efficient functions
(like above utf8_decode, for example).
-- best regards
Cezary H. Noweta