[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Tricky little string parsing challenge
- From: Sean Conner <sean@...>
- Date: Thu, 21 Mar 2019 21:43:46 -0400
It was thus said that the Great Steve Litt once stated:
> On Thu, 21 Mar 2019 18:08:15 -0400
> Sean Conner <sean@conman.org> wrote:
>
> > It was thus said that the Great Steve Litt once stated:
> > > On Tue, 19 Mar 2019 12:07:19 +0000
> > > Geoff Smith <spammealot1@live.co.uk> wrote:
> > >
> > >
> > > > Of course I had forgotten about not splitting on decimal points in
> > > > numbers. How can I adapt this to ignore the full stop character
> > > > if surrounded by numbers?
> > > >
> > > > Thanks for any solutions.
> > >
> > > The problem is in the specification. It's not easy to describe
> > > what's a sentence ender and what's a decimal point. I'd split on a
> > > dot followed immediately by whitespace: Space, Tab, Newline or
> > > Formfeed.
> >
> > Mr. Litt would break on a dot followed by whitespace. Mr. Conner
> > would disagree, as he thinks e. e. cummings would also disagree.
> > What constitutes a sentence? Is this a sentence?
> >
> > -spc (He who took the No. 9 train.)
>
> I have no idea whether the preceding is a sentence. Maybe the Chicago
> Manual of Style would help?
>
> You bring up an interesting point. No matter how wonderful our
> sentence detection algorithm, there will always be exceptions. Maybe
> the key is to go as far as possible with the general algorithm, and
> then use a blacklist and whitelist for each of specific text in the
> document and specific phrases.
>
> Also, as Dirk pointed out, it might be better to split on a dot, one
> or two spaces, and a capital letter.
Mr.
Litt would break on a dot followd by whitespace.
Mr.
Conner would disagree, as he thinks e.
e.
cummings would also disagree.
> Unless, of course, you're
> beginning the sentence with "systemd",
e. e. cummings is also an exception here.
> whose producers insist on
> spelling it with all small characters. Also, I forgot that sentences
> can end with an exclamation point or a question mark.
You forgot the interobang‽
> Is regex the best way, or might this better be done with callback
> routines?
LPEG.
-spc (Definitely LPEG)