Re: utf8.len and BOM

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: utf8.len and BOM
From: Rob Kendrick <rjek@...>
Date: Fri, 16 Jan 2015 17:21:20 +0000

On Fri, Jan 16, 2015 at 09:17:08AM -0800, Coda Highland wrote:
> On Fri, Jan 16, 2015 at 4:53 AM, Rob Kendrick <rjek@rjek.com> wrote:
> > On Fri, Jan 16, 2015 at 12:11:41PM +0000, Aapo Talvensaari wrote:
> >> Is it by design that utf.len count the BOM to length?
> >>
> >> Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?
> >
> > Given UTF8 has only one valid "byte order", it makes no sense to ever
> > include a byte order marker in a UTF8 document.
> >
> 
> Sure it does -- the UTF-8 BOM is used (and aggressively promoted by
> Microsoft) as a magic number to identify the contents of the file as
> UTF-8 text. 

Lots of things aggressively promoted by Microsoft are mistakes.

No BOM -> content is UTF8-encoded.


> The XML spec even explicitly supports this (although many
> XML parsers do not).

Probably because they have to deal with people using Microsoft text
editors, not because it's a recommendation.

B.

Follow-Ups:
- Re: utf8.len and BOM, Coda Highland
- Re: utf8.len and BOM, Luiz Henrique de Figueiredo

References:
- utf8.len and BOM, Aapo Talvensaari
- Re: utf8.len and BOM, Rob Kendrick
- Re: utf8.len and BOM, Coda Highland

Prev by Date: Re: utf8.len and BOM
Next by Date: Re: utf8.len and BOM
Previous by thread: Re: utf8.len and BOM
Next by thread: Re: utf8.len and BOM
Index(es):
- Date
- Thread