Re: question about Unicode

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: question about Unicode
From: Mike Pall <mikelu-0612@...>
Date: Thu, 7 Dec 2006 22:40:28 +0100

Hi,

David Given wrote:
> What I'd rather see, though, is a clear
> statement that *all* high-bit bytes are treated as valid in identifiers, and a
> removal of the locale-specific behaviour for low-bit characters in favour of
> fixed (and documented) tables.

I second this. Locale-dependent lexing is bad. The above rule is
both simple and effective. Please let's avoid the Java mess.

I.e. an identifier matches: /[A-Za-z_\x80-\xff][0-9A-Za-z_\x80-\xff]*/

Replacing isdigit, isalnum, isalpha, isspace, iscntrl in llex.c
should suffice. Overhead: a 257 (*) byte read-only table holding the
bitmasks. You can speed it up if you fold in the checks for '_'
and '.'. It should be faster anyway, because the NLS-aware libc
ctype macros need a function call which isn't always optimized
away by the compiler.

(*) EOF (-1) must be handled, too.

Bye,
     Mike

Follow-Ups:
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Philippe Lhoste

References:
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith
- Re: question about Unicode, Adrian Perez
- Re: question about Unicode, David Given

Prev by Date: Re: Which thread does the garbage collector run in?
Next by Date: Re: question about Unicode
Previous by thread: Re: question about Unicode
Next by thread: Re: question about Unicode
Index(es):
- Date
- Thread