[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Managing Unicode (UTF-8 and UTF-16) data in Lua
- From: Coda Highland <chighland@...>
- Date: Sun, 7 Aug 2016 13:21:04 -0700
On Sun, Aug 7, 2016 at 7:59 AM, Egor Skriptunoff
<egor.skriptunoff@gmail.com> wrote:
>
>> > Operations on fixed width character strings (such as UTF-16) are
>> > processed faster.
>>
>> UTF-16 isn't fixed char width.
>
>
> Yes, you are absolutely correct.
> UTF-16 uses surrogate pairs to represent codepoints above 0x10000.
> But Windows does not support them.
> When you are writing a surrogate-pair-symbol to Windows console
> (I've tested this on Win7 with a simple program using WriteConsoleW),
> it gets displayed as two question marks,
> that is, Windows considers it as two separate symbols instead of just one.
>
> If Windows does not support surrogate pairs, why should we?
> That's why we can treat UTF-16 on Windows as fixed-char-width encoding.
>
> Of course, this means that 100% correct Unicode "print()" function is
> non-implementable for Windows console applications.
>
Windows DOES "support" surrogates -- it upgraded from UCS-2
(equivalent to UTF-16 constrained to the BMP) to UTF-16 a long time
ago (Win7, I think). But it supports them in the sense that it renders
them correctly and won't screw them up if they exist. The support is
roughly equivalent to Lua's UTF-8 support: if you know what you're
doing and you explicitly ask for it, then it can deal with it, but if
you just use the naive wide-string functions it'll treat them as
multiple characters.
/s/ Adam