[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: question about Unicode
- From: Glenn Maynard <glenn@...>
- Date: Thu, 7 Dec 2006 16:56:27 -0500
On Thu, Dec 07, 2006 at 09:55:19AM -0500, Rici Lake wrote:
> Yes, I agree with that completely. It would have been better
> to use native-endian UTF-16 as an internal representation, and
> UTF-8 as a transfer encoding, which I believe is what Unicode
> Consortium recommends. UTF-16 uses a maximum of 4 bytes to
> represent any code point, but the vast majority of code points
> actually used fit into 2 bytes.
UTF-16 is terrible. It combines the annoyances inherent in any Unicode
representation (combining characters resulting in one glyph being
represented by several codepoints); with the annoyances of a wide
representation (incompatible with regular C strings; if it becomes
desynchronized, eg. due to data error, it'll never resync); with
the annoyances of a multibyte representation (a single codepoint can
take a variable number of data elements; no random access to codepoints).
UTF-32 at least does away with the last: a single data element (wchar_t)
always represents a single codepoint. That codepoint may not represent
the entire glyph, but that's a separate problem--in UTF-16, you have
to cope with both decoding codepoints, and combining multiple codepoints
into one glyph, which are different issues causing different problems.
(I suspect that a lot of application-level UTF-16 code simply ignores
surrogate pairs, turning it into UCS-2, though.)
--
Glenn Maynard