|
On 7-Dec-06, at 4:40 PM, Mike Pall wrote:
Hi, David Given wrote:What I'd rather see, though, is a clearstatement that *all* high-bit bytes are treated as valid in identifiers, and a removal of the locale-specific behaviour for low-bit characters in favour offixed (and documented) tables.I second this. Locale-dependent lexing is bad. The above rule is both simple and effective. Please let's avoid the Java mess. I.e. an identifier matches: /[A-Za-z_\x80-\xff][0-9A-Za-z_\x80-\xff]*/
I'm not convinced by this. It seems to me an invitation to obscure bugs.If I actually use the identifier código (say) in some file, and try to refer to it from another file, it might fail because the encodings are different. For example, one file might be in iso-8859-1, or both of them might be in utf-8 but one of them uses a composed ó and the other one uses an o and a combining accent. These differences may be completely invisible.
I strongly agree that "locale-dependent lexing is bad"; however, robust lexing needs to be aware of unicode normalization forms. Unfortunately, that is by no means cheap.