[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: question about Unicode
- From: Mike Pall <mikelu-0612@...>
- Date: Thu, 7 Dec 2006 22:40:28 +0100
Hi,
David Given wrote:
> What I'd rather see, though, is a clear
> statement that *all* high-bit bytes are treated as valid in identifiers, and a
> removal of the locale-specific behaviour for low-bit characters in favour of
> fixed (and documented) tables.
I second this. Locale-dependent lexing is bad. The above rule is
both simple and effective. Please let's avoid the Java mess.
I.e. an identifier matches: /[A-Za-z_\x80-\xff][0-9A-Za-z_\x80-\xff]*/
Replacing isdigit, isalnum, isalpha, isspace, iscntrl in llex.c
should suffice. Overhead: a 257 (*) byte read-only table holding the
bitmasks. You can speed it up if you fold in the checks for '_'
and '.'. It should be faster anyway, because the NLS-aware libc
ctype macros need a function call which isn't always optimized
away by the compiler.
(*) EOF (-1) must be handled, too.
Bye,
Mike
- References:
- Re: question about Unicode, Matt Campbell
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Jones
- Re: question about Unicode, Roberto Ierusalimschy
- Re: question about Unicode, David Given
- Re: question about Unicode, Rici Lake
- Re: question about Unicode, Roberto Ierusalimschy
- Re: Re: question about Unicode, Ken Smith
- Re: question about Unicode, Adrian Perez
- Re: question about Unicode, David Given