[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: LPEG documentation needs more clarification
- From: Sean Conner <sean@...>
- Date: Thu, 14 Feb 2019 14:37:02 -0500
It was thus said that the Great pygy79@gmail.com once stated:
> On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <sean@conman.org> wrote:
> >
> >
> > I managed to generate a segfault with LPEG and I can reproduce the issue
> > with this code [1]:
> >
> > local lpeg = require "lpeg"
> > local Cg = lpeg.Cg
> > local Cc = lpeg.Cc
> > local Cb = lpeg.Cb
> > local P = lpeg.P
> >
> > local cnt = Cg(Cc(0),'count')
> > * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
> > * Cb'count'
> >
> > print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
>
> LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
> * 52 with Lua 5.3).
>
> Just in case, your problem can be solved with a folding capture and no
> temp Lua variable:
>
> local cnt = Cf(
> Cp() * P(1)^0 * Cp(),
> function(first, last) return last - first end
> )
Cool solution, but it won't work for my use case. Sigh.
So here's the actual issue I was trying to solve. I'm dealing with UTF-8
text in a terminal (xterm). I want to trim a line of text to fit a line.
utf8.len() won't work because it counts code points and not the actual
number of characters that will be drawn. For examle, for the string
x = "Spin̈al Tap"
string.len(x) returns 12, utf8.len(x) returns 11, but it takes 10 character
positions (that's a "Combining Diaeresis" over the 'n' character). So there
are certain Unicode codepoints I want to skip counting---I want a "display
length", not a "codepoint length". The example I gave was a bad example in
this case.
Anyway, I do have code that works using lpeg.Carg():
local cutf8 = R" ~" -- ASCII minus C0 control set
+ lpeg.P"\194" * lpeg.R"\160\191" -- UTF minus C1 control set [1]
+ lpeg.R"\195\223" * lpeg.R"\128\191"
+ lpeg.P"\224" * lpeg.R"\160\191" * lpeg.R"\128\191"
+ lpeg.R"\225\236" * lpeg.R"\128\191" * lpeg.R"\128\191"
+ lpeg.P"\237" * lpeg.R"\128\159" * lpeg.R"\128\191"
+ lpeg.R"\238\239" * lpeg.R"\128\191" * lpeg.R"\128\191"
+ lpeg.P"\240" * lpeg.R"\144\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
+ lpeg.R"\241\243" * lpeg.R"\128\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
+ lpeg.P"\244" * lpeg.R"\128\143" * lpeg.R"\128\191" * lpeg.R"\128\191"
local nc = P"\204" * R"\128\191" -- combining chars
+ P"\205" * R"\128\175" -- combining chars
+ P"\225\170" * R"\176\190" -- combining chars
+ P"\225\183" * R"\128\191" -- combining chars
+ P"\226\131" * R"\144\176" -- combining chars
+ P"\239\184" * R"\160\175" -- combining chars
+ P"\u{00AD}" -- shy hyphen
+ P"\u{1806}" -- Mongolian TODO soft hyphen
+ P"\u{200B}" -- zero width space
+ P"\u{200C}" -- zero-width nonjoiner space
+ P"\u{200D}" -- zero-width joiner space
local cnt = (nc + cutf8 * Carg(1) / function(s) s.cnt = s.cnt + 1 end)^0
* Carg(1) / function(s) return s.cnt end
It's not 100% perfect [2][3] but for what I'm doing, it works.
-spc (I should mention that the string have had all control codes and
sequences removed, so I do not need concern myself with that ...)
[1] Definition of UTF-8 I'm using comes from RFC-3629
[2] Doesn't handle RtL text; and there are other 0-width characters I'm
missing.
[3] I could repalce the \u{hhh} construction with something that works
for Lua 5.1.