[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: ignoring BOM
- From: Miles Bader <miles@...>
- Date: Mon, 01 Jun 2009 17:19:15 +0900
Javier Bezos <noreply@tex-tipografia.com> writes:
> Silly or not, it's a valid option in a UTF-8 file, according to
> the Unicode standard:
>
> In UTF-8, the BOM corresponds to the byte sequence <EF16 BB16 BF16>.
> Although there are never any questions of byte order with UTF-8 text,
> this sequence can serve as signature for UTF-8 encoded text where the
> character set is unmarked.
UTF-8 was carefully designed so that it stands a good chance of working
properly with non-UTF-8-aware applications, as long as one restricts the
use of the non-ASCII subset to comments and strings, etc, and the
application treat characters in such contexts with the 8th bit set as
opaque data. This includes many compilers etc. (and probably lua)
However an MS-style BOM occurs at the beginning of the file, not inside
a comment or string where the language can blithely pass it through.
As far as Lua (or any other application that doesn't treat UTF-8 or BOM
specially) is concerned, there are 3 garbage characters at the beginning
of your file.
Of course in typical MS fashion, MS editors and tools etc make it hard
to avoid adding the BOM or sometimes require the BOM to recognize a file
as UTF-8 (my own miserable experience is with the Japanese version of
Visual Studio...) -- thus breaking the carefully designed attempt at
compatibility that UTF-8 normally provides.
-Miles
--
Dictionary, n. A malevolent literary device for cramping the growth of
a language and making it hard and inelastic. This dictionary, however,
is a most useful work.