[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: question about Unicode
- From: Rici Lake <lua@...>
- Date: Thu, 7 Dec 2006 16:40:55 -0500
On 7-Dec-06, at 3:44 PM, Brian Weed wrote:
Asko Kauppi wrote:
But there may be some identifier "stamp" that can be used to know a
file is UTF-8, no?
There are two that I know of. I don't know how "standard" they are.
One is called a BOM Header, which is some binary code in the first 2
bytes of the "text" file. The other is the occurrence of this text
"charset=utf-8", anywhere in the file (at least according to the
editor I use: UltraEdit).
OK, another delve into the intricacies of Unicode. An "encoding form"
is a mapping between sequences of numbers in some word size and unicode
characters. There are three of these, corresponding to 8-bit, 16-bit,
and 32-bit numbers.
However, serializing a sequence of numbers into a sequence of bytes is
subject to the vagaries of endianness, so Unicode also defines
"encoding schemes", which is a specification of a byte-serialized
string in some encoding form. There are seven character encoding
schemes defined by Unicode (and several others which are in less common
use): UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE.
BE and LE refer to endianness; if a string is advertised as being in
UTF-16BE, for example, the 16-bit numbers are unambiguously serialized
in big-endian (i.e. network order).
The two encoding schemes with unspecified endianness, UTF-16 and
UTF-32, *may* start with 0xFEFF, the so-called Byte Order Mark (BOM).
If it does, the BOM reveals the endianness of the encoding, and does
not form part of the data stream. (If it doesn't start with a BOM, the
data must be serialized in big-endian format.)
U+FEFF is a valid Unicode character, the rather Zen "Zero Width
No-Break Space". If a stream advertised as being UTF-8, UTF-16BE,
UTF-16LE, UTF-32BE, or UTF-32LE starts with U+FEFF, then that character
must be passed on to the application, which will presumably ignore it
since it's hard to know how else to process such a character. (However,
it would form part of a MAC signature, if you were constructing one.)
You cannot tell with absolute rigor whether a stream is UTF-16 or
UTF-32 by examining the first bytes, because U+0000 is a legal
character; thus, a stream start 0x00 0x00 0xFE 0xFF could be big-endian
16-bit representing the characters NUL, ZWNBS, or it could be a UTF-32
BOM. Similarly, a stream starting 0xFF 0xFE 0x00 0x00 could be a
little-endian 16-bit BOM followed by a NUL, or a little-endian 32-bt
BOM.
A UTF-8 stream might start with a ZWNBS (a practice the Unicode
Consortium "neither requires nor recommends"), but it would be
interpreted as a ZWNBS (part of the character stream) and not a BOM.
This would be a pretty good indication that the stream was UTF-8
(although it could be the unlikely iso-8859-1 sequence ï«¿).
All this may seem like sophistry, but it is important if you're doing
digital signatures. Unicode assumes that there will be some external
indication of how a byte stream is to be encoded such as a MIME header
or XML declaration.
Far and away the simplest mechanism is to require that character
strings used in data exchange be in UTF-8. I'd certainly be quite happy
for Lua to insist on UTF-8 as a source file encoding format (rejecting
invalid byte sequences); transcoding could be left to utilities like
iconv which seem to be pretty universally available.
- References:
- question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith
- Re: question about Unicode, Adrian Perez
- Re: question about Unicode, Asko Kauppi
- Re: question about Unicode, Brian Weed