[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Tricky little string parsing challenge
- From: Sean Conner <sean@...>
- Date: Tue, 19 Mar 2019 12:44:43 -0400
It was thus said that the Great Geoff Smith once stated:
> This one has got me stuck for the moment, can anyone come up with an
> elegant solution for this without needing external library please.
>
> I have a long string of text that i need to split into sentences, here is
> a sort of working attempt
>
> local text = "This is one sentence. This is another but with a number in it like 0.47 need to ignore it. This is the third. Fourth sentence"
>
> local sentences = {}
> for i in string.gmatch(text, "[^%.]+" ) do
> sentences[#sentences+1] = i
> end
>
> for i = 1, #sentences do
> print(i, sentences[i])
> end
>
> Of course I had forgotten about not splitting on decimal points in
> numbers. How can I adapt this to ignore the full stop character if
> surrounded by numbers?
I had a similar issue back in 2014 [1] where I used LPEG to do the
parsing. What I found is that just breaking on a period wasn't enough, and
so I had to special case the following [2]:
MR.
Mrs.
MRS.
Dr.
DR.
P. S.
P.S.
T. E.
T.E.
Gen.
N. B.
N.B.
H.
M.
O.
Z.
The nice thing about LPEG was not only how easy it was to add exceptions
(like the list above) but I could also transform the input into a canonical
format (like converting N.B. to N. B.).
So yes, I do have a solution, but it does violate your constraint.
-spc (My use case was breaking the input into words, but it's similar
enough ... )
[1] https://github.com/spc476/NaNoGenMo-2014
Code I used:
https://github.com/spc476/NaNoGenMo-2014/blob/master/word.lua
[2] Some, like Mrs. are generic, while T. E. were initials specific to
the document.