[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Lost in Unicode
- From: Enrico Colombini <erix@...>
- Date: Mon, 20 Oct 2003 15:29:49 +0200
On Monday 20 October 2003 14:20, Roberto Ierusalimschy wrote:
> If the system may use two different representations, the simplest
> solution is to translate to a fixed representation as soon as you read
> something. If you can assume that all relevant utf-8 text can be mapped
> to ISO-8859-1, it is better to use ISO-8859-1 internally. It is easy
> to write a function to translate utf-8 to ISO-8859-1:
>
> function toISO (s)
> if string.find(s, "[\224-\255]") then error("non-ISO char") end
> s = string.gsub(s, "([\192-\223])(.)", function (c1, c2)
> c1 = string.byte(c1) - 192
> c2 = string.byte(c2) - 128
> return string.char(c1 * 64 + c2)
> end)
> return s
> end
Thanks, this seems in fact to be the easiest way; I like the idea of doing it
all in Lua without resorting to C or, worse, having to dig for obscure (and
possibly non-portable) system calls. A small extra complication will be
user-supplied text files, but I could just add a line at the beginning of the
file to specify its format (just like email messages or Web pages).
It's a pity there's no way to distinguish between the two types of text files
by looking at their contents (apart maybe from statistical analysis...).
Enrico