[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: question about Unicode
- From: Glenn Maynard <glenn@...>
- Date: Thu, 7 Dec 2006 17:25:22 -0500
On Thu, Dec 07, 2006 at 03:44:05PM -0500, Brian Weed wrote:
> Asko Kauppi wrote:
> >But there may be some identifier "stamp" that can be used to know a
> >file is UTF-8, no?
> There are two that I know of. I don't know how "standard" they are.
> One is called a BOM Header, which is some binary code in the first 2
> bytes of the "text" file.
Three: 0xEF 0xBB 0xBF. Don't use that unless you're writing
Windows-specific stuff and you really need to be compatible with
other Windows applications that expect it--it's not "binary" any
more than any other UTF-8 character, but text file encodings do not
have headers! (And if you--the reader, not Brian Weed--do use this,
make it a save-time option and disable it by default if possible.)
> The other is the occurrence of this text
> "charset=utf-8", anywhere in the file (at least according to the editor
> I use: UltraEdit).
What if a Japanese writer is explaining, in a Shift-JIS, how to use this
feature? "charset=utf-8" can legitimately appear in text files of any
encoding. This email is not UTF-8, but it contains that string. :)
There is no portable way to tell for sure whether a file is UTF-8. If
you don't know the encoding of a file, you can only guess, but every
guessing mechanism can guess wrong.
--
Glenn Maynard
- References:
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith
- Re: question about Unicode, Adrian Perez
- Re: question about Unicode, Asko Kauppi
- Re: question about Unicode, Brian Weed