[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: lua for unicode
- From: lua+Steven.Murdoch@...
- Date: Wed, 04 Dec 2002 12:29:36 +0000
> > An important consideration to be made is whether all strings are Unicode
> > or whether a new Unicode type is to be added (as is done in Python).
>
> I think we can live outside these two options. Strings may contain
> Unicode data or not (e.g. they may contain raw binary data, as now).
> If you call a function from the new "utf8" library, it will assume
> the string is a Unicode-utf8 string.
This approach sounds reasonable. If the utf8 library is the only Unicode
string manipulation library then this will effectively be using UTF-8 for the
internal encoding of Unicode strings. This has the advantage of bringing some
backward compatibility characteristics, but probably decreases efficiency.
Most langauges I know of use UTF-16 for encoding Unicode strings but this
choice depends on a number of options so it not necessarily valid for Lua.
> > It is essential that such byte patterns [non-valid Unicode character]
> > do not exist in the internal encoding since this opens several
> > security issues.
>
> I think it would be easier to allow such patterns (among other things
> because strings may contain other stuff besides Unicode data), and to
> check for consistency when needed (that is, inside the functions of the
> "utf8" library).
Yes, that is equally good. The essential feature is not to allow invalid bit
patterns be interpreted as valid UTF-8/UTF-16/etc data. This is normally done
by ensuring that any Unicode strings created are guaranteed to be valid, but
this would not permit binary data to be stored in this datatype. Checking
consistency on read may bring a small performance penalty but this probably
will not be significant.
However it would be desirable to check consistency of Unicode data as it is
read from files, since then errors would be caught immediately rather than
later during processing. A Unicode I/O library would be necessary anyway since
data may have to be read in, or outputted in a format other than UTF-8.
Consistency of UTF-8 strings must also be checked before they are written out
to Unicode files.
Steven Murdoch.