[Date Prev][Date Next][Thread Prev][Thread Next]
[Date Index]
[Thread Index]
- Subject: Re: Tricky little string parsing challenge
- From: Dirk Laurie <dirk.laurie@...>
- Date: Wed, 20 Mar 2019 08:10:32 +0200
Op Di. 19 Mrt. 2019 om 14:40 het Geoff Smith <spammealot1@live.co.uk> geskryf:
>
> This one has got me stuck for the moment, can anyone come up with an elegant solution for this without needing external library please.
>
> I have a long string of text that i need to split into sentences, here is a sort of working attempt
>
> local text = "This is one sentence. This is another but with a number in it like 0.47 need to ignore it. This is the third. Fourth sentence"
>
> local sentences = {}
> for i in string.gmatch(text, "[^%.]+" ) do
> sentences[#sentences+1] = i
> end
>
> for i = 1, #sentences do
> print(i, sentences[i])
> end
>
> Of course I had forgotten about not splitting on decimal points in numbers. How can I adapt this to ignore the full stop character if surrounded by numbers?
You can use the pattern (fulllstop, whitespace, capital letter) to
terminate a sentence.
function break_into_sentences(str)
local prose = {}
local start = 1
repeat
local sentence,found = str:match("(.-)%.%s+()%u",start)
prose[#prose+1] = sentence
if found then start=found end
until not found
prose[#prose+1] = str:sub(start)
return prose
end
Of course, there will still be some cases like "Dr. No: where the
author did not mean to end a sentence. That is no longer a string
parsing challenge, but a question of designing an unambiguous grammar.
For example:
* Bernard Shaw 100 years ago already dropped full stops in
abbreviations, and it is fairly common practice nowadays.
* I used to (but no longer do) put two spaces after each genuine fullstop.
* You could put hard spaces after abbreviations.
* You could gsub a list of allowed exceptions replacing a fullstop by
a central dot or other unused Unicode character, and afterwords gsub
them back in.
Etc.