[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Managing Unicode (UTF-8 and UTF-16) data in Lua
- From: Paul Moore <p.f.moore@...>
- Date: Sat, 6 Aug 2016 17:54:49 +0100
On 6 August 2016 at 16:49, Egor Skriptunoff <egor.skriptunoff@gmail.com> wrote:
>
>> I'm trying to embed Lua in a Windows program that needs to be "Unicode
>> clean".
>
> ...
>>
>> 1. The print() function doesn't handle UTF-8
>
>
> Why do you want to use UTF-8 encoding on Windows?
So that I can display all Unicode characters.
> IMO, there is absolutely no reason to use UTF-8 on Windows-only
> applications.
No reason to use UTF-8, certainly, but definitely a reason to use a
full-Unicode encoding (as opposed to incomplete encodings like cp850
or cp1252
> BTW, I don't know if it is possible to display arbitrary Unicode symbol in
> Windows console,
It is. But you have to use the Unicode APIs, not the C stdlib.
> so maybe the wish of making Unicode version of print() is unattainable
> anyway.
It's perfectly possible, but not with ANSI C only (which is what core
Lua requires).
> All you really need is UTF-16LE strings (which are used by W-functions of
> WinAPI).
> But there are some problems to solve:
That's what I'm trying to do - solve these problems.
> Problem #1: most of string library functions work incorrectly with UTF-16.
> Solution:
> You should write your own implementation of UTF-16 string library functions:
> string16.sub(), string16.gmatch(), string16.upper() and so on if you need
> them.
Why do that when the standard Lua string type is UTF-8 safe? Better
surely to use UTF-8 via Lua strings, and only use UTF-16 for
interfacing to the Windows APIs?
> Problem #2: io.popen() generates output in cp850.
> Solution:
> You can use Lua standard function io.popen() to get UTF-16 output.
> Windows does support Unicode output for all internal commands by prefixing
> commands with "cmd /u/d/c"
> For example, the following code gets Unicode filenames as one UTF-16 string:
>
> local cmd1 = [[echo List#1]]
> local cmd2 = [[dir /b]]
> local cmd3 = [[echo List#2]]
> local cmd4 = [[dir "C:\Program Files" /b]]
> local cmd = [["cmd /u/d/c "]]..cmd1.."&"..cmd2.."&"..cmd3.."&"..cmd4..[[""]]
> -- cmd here is Lua string in 1-byte encoding (win1252)
> local output16 = io.popen(cmd, "rb"):read"*a"
> -- output16 here is Lua string containing UTF-16 symbols
> -- for example, the "Euro" symbol (U+20AC) in output16 will be written as
> "\xAC\x20".
> -- lines in output16 are separated by "\r\0\n\0"
That sounds pretty complicated, and wouldn't work for a directory with
a euro sign in the name. While rewriting popen (using Windows API
calls) to use UTF-8 for the command line, and return bytes isn't easy,
it's not impossible.
But I'm not trying to make the whole of Lua Unicode-safe here, just
work cleanly with the APIs I need.
> Problem #3: os.execute() and io.popen() accept command line in win1252
> Solution:
> Write your own implementation that accepts UTF-16 command line.
So that was my question - can I do that and replace the existing
os.execute, or do I have to write my own under different names, and
tell people "don't use os.execute, use this instead"?
> Problem #4: os.getenv() generates output in win1252.
> Solution:
>
> function getenv16 (env_var_name)
> -- this is UTF-16 analogue of os.getenv()
> -- env_var_name must be in 1-byte encoding (win1252)
> -- returns UTF-16 string (or nil if the variable is not defined)
> local cmd = [["cmd /u/d/c "if defined ]]..env_var_name..[[ echo
> %]]..env_var_name..[[%""]]
> local pipe = io.popen(cmd, "rb")
> local result = pipe:read"*a"
> pipe:close()
> return result ~= "" and result:gsub("\r%z\n%z$", "") or nil
> end
Again surely easier and better performance (no subprocess) to write a
wrapper around GetEnvironmentVariableW that returns a Lua string
encoded as UTF-8?
Paul