[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
- From: Coda Highland <chighland@...>
- Date: Mon, 11 May 2015 11:54:50 -0700
On Mon, May 11, 2015 at 11:34 AM, Tim Hill <drtimhill@gmail.com> wrote:
>
>> On May 11, 2015, at 10:53 AM, Jay Carlson <nop@nop.com> wrote:
>>
>> On May 11, 2015, at 7:47 AM, Gaspard Bucher <gaspard@teti.ch> wrote:
>>
>>> xml - very fast xml parser
>>> http://github.com/lubyk/xml
>>
>> Glancing through the source code, I don't see code for rejection of non-UTF-8 sequences when parsing in UTF-8 mode. This is important to some people in Lua, since a lot of UTF-8-related security faults just go away if invalid byte sequences are rejected on input; Prosody is a good example of engineering for this.
>>
>
> Which would you reject?
> — UTF-8 has several non-canonical ways to encode a value (e.g. by using more bytes than needed). I’ve seen (bad) encoders that emit these. Reject/accept?
Default reject (as it's more likely to indicate a non-UTF-8 file than
a bad encoder), provide a configuration option to accept (for the use
cases where you know you've got a busted encoder in the pipeline).
> — UTF-8 is sometimes used to encode UTF-16 values (such as BOM), some of which are now accepted. Reject/accept?
ZWNBSP (that is, BOM) is a perfectly legit character in UTF-8. Accept
it. I get pissed at decoders that choke on it.
> — UTF-8 can encode high/low UTF-16 surrogate pairs, which should be invalid but could be converted to a codepoint. Reject/accept?
> .. and so on.
UTF-8-encoded surrogates get generated by LOTS of encoders that
otherwise follow spec. Even though it's in violation of the spec,
accept by default. Offer a configuration option to preserve the
characters as encoded instead of combining the surrogates into a
single non-BMP code point for the purpose of analysis.
/s/ Adam
- References:
- [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3, Gaspard Bucher
- xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Jay Carlson
- Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3), Tim Hill
- Prev by Date:
Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
- Next by Date:
Re: Building lua wrapper for coolprop
- Previous by thread:
Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
- Next by thread:
Re: xml/rapidxml UTF-8 validity assurance (was: [ANN] lub, lut, xml, yaml, dub and osc for Lua 5.3)
- Index(es):