Lua Unicode |
|
In short, yes and no. Lua is unicode-agnostic and lua-strings are counted, so whenever you can treat unicode strings as simple byte sequences, you are done. Whenever that does not suffice, there are extension modules supplying your needs. You just have to figure out what exactly you mean by "support unicode" and use the proper abstraction from the right module. Unicode is extremely complex.
Some of the issues are:
A Lua string is an arbitrary counted sequence of bytes (C chars of your compiler, so 8 bit or bigger).
Lua does not reserve any value, including NUL
, so arbitrary binary data, including unicode data, can be stored.
For best results, use an encoding with unicode codeunits no bigger than a single byte, which normally restricts you to utf8.
Any other encoding can be stored as well, including but not limited to UTF-16
, UTF-32
and their various big-endian/little-endian variants.
Input and output of strings in Lua (using the io library) conforms to C's guarantees.
ANSI C
only requires the stdio
library to handle arbitrary data in binary mode.
In text mode, the runtime is allowed to map on input and output in order to handle line-end conventions or even charsets foreign to C.
Woe unto you if the library expects you to use an internal encoding incompatible with your actual encoding, and it tries to adapt line-end conventions or even changes the encoding.
This may affect your ability to do non-binary file input and output of Unicode.
UTF-8
is probably safe, because it is ASCII-compatible and never uses
ASCII-characters as part of a multibyte encoding for a codepoint.
All modern systems only do minimal byte-sequence mapping for line-endings in textmode, unix going so far as not needing even that and so making text and binary mode identical.
If your use of Unicode is restricted to passing the strings to external libraries that support Unicode, you should be OK. For example, you should be able to extract a Unicode string from a database and pass it to a Unicode-aware graphics library.
Literal Unicode strings can appear in your lua programs. Either a UTF-8
encoded string can appear directly with 8-bit characters or you can use
the \ddd
syntax (note that ddd
is a decimal
number, unlike some other languages). However, there is no facility for
encoding multi-octet sequences (such as \U+20B4
); you would need to
either manually encode them to UTF-8
, or insert individual octets in the
correct big-endian/little-endian order (for UTF-16
or UTF-32
).
Unless you are using an operating system in which a char
is more
than eight bits wide, you will not be able to use arbitrary Unicode
characters in Lua identifers (for the names of variables and so on).
You may be able to use eight-bit characters outside of the ANSI
range.
Lua uses the C
functions isalpha
and isalnum
to identify valid
characters in identifiers, so it will depend on the current locale.
To be honest, using characters outside of the ANSI
range
in Lua identifiers is not a good idea, since your programs will not compile
in the standard C
locale.
Lua string comparison (using the == operator) is done byte-by-byte. That means that == can only be used to compare Unicode strings for equality if the strings have been normalized in one of the four Unicode normalizations. (See the [Unicode FAQ on normalization] for details.) The standard Lua library does not provide any facility for normalizing Unicode strings. Consequently, non-normalized Unicode strings cannot be reliably used as table keys.
If you want to use the Unicode notion of string equality, or use Unicode strings as table keys, and you cannot guarantee that your strings are normalized, then you'll have to find a normalization function and use that; writin one is a non-trivial exercise!
The Lua comparison operators on strings (< and <=) use the C
function
strcoll
, which is locale dependent. This means that two strings
can compare in different ways according to what the current locale is.
For example, strings will compare differently when using Spanish
Traditional sorting to that when using Welsh sorting.
It may be that your operating system has a locale that implements the sorting algorithm that you want, in which case you can just use that, otherwise you will have to write a function to sort Unicode strings. This is an even more non-trivial exercise.
UTF-8
was designed so that a naive byte-by-byte string comparison
of an octet sequence would produce the same result as a codepoint by codepoint
comparison.
This is also true of UTF-32BE
but I do not know of any system that uses
that encoding. Unfortunately, naive byte-by-byte comparison is
not the collation order used by any language.
(Note: sometimes people use the terms UCS-2
and UCS-4
for "two-byte"
and four-byte encodings. These are not Unicode standards; they come from the
closely corresponding ISO
standard ISO/IEC 10646-1:2000
and currently
differ in that they allow codes outside of the Unicode range, which runs from
0x0
to 0x10FFFF
.)
Lua's pattern matching facilities work byte by byte.
In general, this will not work for Unicode pattern matching, although
some things will work as you want. For example, "%u"
will not match all Unicode upper case letters. You can match
individual Unicode characters in a normalized Unicode string, but
you might want to worry about combining character sequences.
If there are no following combining characters, "a"
will
match only the letter a
in a UTF-8
string. In UTF-16LE
you could
match "a\0"
.
If you use unicode strings, there are at least five different notions of length. Beware of using the wrong one.
string.len
offers that.
string.len
by bytes per codeunit according to your encoding.
utf8.len
counts them for UTF-8. There is no function for UTF16-(LE|BE), though you can easily construct one using string.gsub
or string.gmatch
. For UTF-32-(LE|BE), equals codeunits.
For example, you could use the following code snippet to count UTF-8 characters in a string you knew to be conforming (it will incorrectly count some invalid characters):
local _, count = string.gsub(unicode_string, "[^\128-\193]", "")
If you want to know how many printing columns a Unicode string will
occupy when you print it out using a fixed-width font (imagine you are
writing something like the Unix ls
program that formats its
output into several columns), then that is a different answer again.
That's because some Unicode characters do not have a printing width,
while others are double-width characters. Combining characters are
used to add accents to other letters, and generally they do not
take up any extra space when printed.
You could use the following code snippet to iterate over UTF-8 sequences (this will simply skip over most invalid codes):
for uchar in string.gmatch(ustring, "([%z\1-\127\194-\244][\128-\191]*)") do -- something end
--[[ | bits | U+first | U+last | bytes | Byte_1 | Byte_2 | Byte_3 | Byte_4 | Byte_5 | Byte_6 | +------+-----------+------------+-------+----------+----------+----------+----------+----------+----------+ | 7 | U+0000 | U+007F | 1 | 0xxxxxxx | | | | | | | 11 | U+0080 | U+07FF | 2 | 110xxxxx | 10xxxxxx | | | | | | 16 | U+0800 | U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | | | | | 21 | U+10000 | U+1FFFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | | | | *26 | U+200000 | U+3FFFFFF | 5 | 111110xx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | | | *31 | U+4000000 | U+7FFFFFFF | 6 | 1111110x | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | --]] * UTF8 was restricted to 4 bytes, because UTF16 surrogates enable a maximum of BMP+16 astral planes -> not quite 21 bit.
This function converts a lua string that contains UTF-8 encoded characters into a lua table with its corresponding unicode codepoints (UTF-32)
function Utf8to32(utf8str) assert(type(utf8str) == "string") local res, seq, val = {}, 0, nil for i = 1, #utf8str do local c = string.byte(utf8str, i) if seq == 0 then table.insert(res, val) seq = c < 0x80 and 1 or c < 0xE0 and 2 or c < 0xF0 and 3 or c < 0xF8 and 4 or --c < 0xFC and 5 or c < 0xFE and 6 or error("invalid UTF-8 character sequence") val = bit32.band(c, 2^(8-seq) - 1) else val = bit32.bor(bit32.lshift(val, 6), bit32.band(c, 0x3F)) end seq = seq - 1 end table.insert(res, val) table.insert(res, 0) return res end
As you might have guessed by now, Lua provides no support for things like bidirectional printing or the proper formatting of Thai accents. Normally such things will be taken care of by a graphics or typography library. It would of course be possible to interface to such a library that did these things if you had access to one.
Note: since Lua 5.3, there's a builtin module called "utf8". Some of these modules are called "utf8" too, and will cause a name clash. Please require them under a different name.
string
library except string.reverse
, as well as functions to perform canonical composition and decomposition of Unicode codepoints (NFC and NFD). string.reverse
is not implemented because, for it to be useful, [it would require] parsing of [grapheme clusters], which is complicated. A [PHP-based version] of this library is found in [Scribunto], the MediaWiki? Lua-scripting extension, which is installed on Wikimedia sites such as Wikipedia and Wiktionary.
See UnicodeIdentifers for platform independent Unicode Lua programs.