[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: LPEG documentation needs more clarification
- From: Adrian Perez de Castro <aperez@...>
- Date: Fri, 15 Feb 2019 15:56:43 +0200
Hello!
On Fri, 15 Feb 2019 21:27:19 +0800, Sam Atman <atmanistan@gmail.com> wrote:
> It would seem there is a pure Lua wcwidth already: https://github.com/aperezdc/lua-wcwidth
Author of lua-wcwdith here! Reading this thread I just remembered that the
module needed an update to the latest Unicode version, so I just pushed
version 0.3 earlier today :)
If you end up using it, feel free to ask about about it. I mostly lurk
in lua-l nowadays, but I do read most of the threads anyway and I am
happy to help out.
Cheers,
-Adrián
> Sam Atman
> Principal Agent, Special Circumstances
> ~wep
>
> > On Feb 15, 2019, at 10:24 AM, Xavier Wang <weasley.wx@gmail.com> wrote:
> >
> >
> >
> > Sean Conner <sean@conman.org>于2019年2月15日 周五03:37写道:
> >> It was thus said that the Great pygy79@gmail.com once stated:
> >> > On Wed, Jan 30, 2019 at 4:58 AM Sean Conner <sean@conman.org> wrote:
> >> > >
> >> > >
> >> > > I managed to generate a segfault with LPEG and I can reproduce the issue
> >> > > with this code [1]:
> >> > >
> >> > > local lpeg = require "lpeg"
> >> > > local Cg = lpeg.Cg
> >> > > local Cc = lpeg.Cc
> >> > > local Cb = lpeg.Cb
> >> > > local P = lpeg.P
> >> > >
> >> > > local cnt = Cg(Cc(0),'count')
> >> > > * (P(1) * Cg(Cb'count' / function(c) return c + 1 end,'count'))^0
> >> > > * Cb'count'
> >> > >
> >> > > print(cnt:match(string.rep("x",512+128))) -- CRASH at some point past this line
> >> >
> >> > LuLPeg also crashes, but for a larger string (between 512 * 51 and 512
> >> > * 52 with Lua 5.3).
> >> >
> >> > Just in case, your problem can be solved with a folding capture and no
> >> > temp Lua variable:
> >> >
> >> > local cnt = Cf(
> >> > Cp() * P(1)^0 * Cp(),
> >> > function(first, last) return last - first end
> >> > )
> >>
> >> Cool solution, but it won't work for my use case. Sigh.
> >>
> >> So here's the actual issue I was trying to solve. I'm dealing with UTF-8
> >> text in a terminal (xterm). I want to trim a line of text to fit a line.
> >> utf8.len() won't work because it counts code points and not the actual
> >> number of characters that will be drawn. For examle, for the string
> >>
> >> x = "Spin̈al Tap"
> >>
> >> string.len(x) returns 12, utf8.len(x) returns 11, but it takes 10 character
> >> positions (that's a "Combining Diaeresis" over the 'n' character). So there
> >> are certain Unicode codepoints I want to skip counting---I want a "display
> >> length", not a "codepoint length". The example I gave was a bad example in
> >> this case.
> >
> > Maybe you just need a wcwidth routine in luautf8 module 😃
> >>
> >> Anyway, I do have code that works using lpeg.Carg():
> >>
> >> local cutf8 = R" ~" -- ASCII minus C0 control set
> >> + lpeg.P"\194" * lpeg.R"\160\191" -- UTF minus C1 control set [1]
> >> + lpeg.R"\195\223" * lpeg.R"\128\191"
> >> + lpeg.P"\224" * lpeg.R"\160\191" * lpeg.R"\128\191"
> >> + lpeg.R"\225\236" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >> + lpeg.P"\237" * lpeg.R"\128\159" * lpeg.R"\128\191"
> >> + lpeg.R"\238\239" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >> + lpeg.P"\240" * lpeg.R"\144\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >> + lpeg.R"\241\243" * lpeg.R"\128\191" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >> + lpeg.P"\244" * lpeg.R"\128\143" * lpeg.R"\128\191" * lpeg.R"\128\191"
> >>
> >> local nc = P"\204" * R"\128\191" -- combining chars
> >> + P"\205" * R"\128\175" -- combining chars
> >> + P"\225\170" * R"\176\190" -- combining chars
> >> + P"\225\183" * R"\128\191" -- combining chars
> >> + P"\226\131" * R"\144\176" -- combining chars
> >> + P"\239\184" * R"\160\175" -- combining chars
> >> + P"\u{00AD}" -- shy hyphen
> >> + P"\u{1806}" -- Mongolian TODO soft hyphen
> >> + P"\u{200B}" -- zero width space
> >> + P"\u{200C}" -- zero-width nonjoiner space
> >> + P"\u{200D}" -- zero-width joiner space
> >> local cnt = (nc + cutf8 * Carg(1) / function(s) s.cnt = s.cnt + 1 end)^0
> >> * Carg(1) / function(s) return s.cnt end
> >>
> >> It's not 100% perfect [2][3] but for what I'm doing, it works.
> >>
> >> -spc (I should mention that the string have had all control codes and
> >> sequences removed, so I do not need concern myself with that ...)
> >>
> >> [1] Definition of UTF-8 I'm using comes from RFC-3629
> >>
> >> [2] Doesn't handle RtL text; and there are other 0-width characters I'm
> >> missing.
> >>
> >> [3] I could repalce the \u{hhh} construction with something that works
> >> for Lua 5.1.
> >>
> >>
> > --
> > regards,
> > Xavier Wang.
Non-text part: text/html
Attachment:
pgpEQYFbN_X7V.pgp
Description: PGP signature