[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Support for Windows unicode paths
- From: Klaus Ripke <paul-lua@...>
- Date: Thu, 23 Jul 2009 18:36:37 +0200
On Thu, Jul 23, 2009 at 03:05:00PM +0100, David Given wrote:
...
> It turns out to be possible to programmatically split a Unicode string
> up into its component grapheme clusters (what I was incorrectly
> referring to as glyphs, and what most people think of as characters).
> So, it ought to be fairly simple to do a Lua addon where you can say:
>
> for c in s:graphemes() do
> print(c)
> end
It is, and actually slnunicode does this, with a few caveats:
"
See http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
for default grapheme clusters.
Lazy westerners we are (and lacking the Hangul_Syllable_Type data),
we care for base char + Grapheme_Extend, but not for Hangul syllable sequences.
For http://unicode.org/Public/UNIDATA/UCD.html#Grapheme_Extend
we use Mn (NON_SPACING_MARK) + Me (ENCLOSING_MARK),
ignoring the 18 mostly south asian Other_Grapheme_Extend (16 Mc, 2 Cf) from
http://www.unicode.org/Public/UNIDATA/PropList.txt
"
It provides multiple string libs, one of which operates on graphemes,
meaning length, substr etc all count grapheme clusters.
> ...where c is a *string* containing a particular grapheme cluster (which
> might be quite long; the link has an example of a four-code point
> cluster). This would actually allow a string to be broken down into an
> array of grapheme clusters to give true random access, which I'd
> previously thought of as being impossible. It'd be expensive, though...
> possibly it'd be worth doing lazily.
It boild down to snippets like:
if (MODE_GRAPH == mode)
while (Grapheme_Extend(code) && p>s) code = utf8_oced(&p, s);
It is not much more expensive than plain UTF-8,
which in turn is not more expensive than UTF-16 done right,
i.e. with checking for the surrogate pairs to encode characters
beyond the BMP.
enjoy
- References:
- Re: Support for Windows unicode paths, Jerome Vuarand
- Re: Support for Windows unicode paths, Joshua Jensen
- Re: Support for Windows unicode paths, Shmuel Zeigerman
- Re: Support for Windows unicode paths, Miles Bader
- Re: Support for Windows unicode paths, Joshua Jensen
- Re: Support for Windows unicode paths, Miles Bader
- Re: Support for Windows unicode paths, Alex Queiroz
- Re: Support for Windows unicode paths, David Given
- Re: Support for Windows unicode paths, Alex Queiroz
- Re: Support for Windows unicode paths, David Given