Re: Tricky little string parsing challenge

lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]

Subject: Re: Tricky little string parsing challenge
From: Dirk Laurie <dirk.laurie@...>
Date: Wed, 20 Mar 2019 08:10:32 +0200

Op Di. 19 Mrt. 2019 om 14:40 het Geoff Smith <spammealot1@live.co.uk> geskryf:
>
> This one has got me stuck for the moment, can anyone come up with an elegant solution for this without needing external library please.
>
> I have a long string of text that i need to split into sentences, here is a sort of working attempt
>
>  local text = "This is one sentence. This is another but with a number in it like 0.47 need to ignore it. This is the third. Fourth sentence"
>
> local sentences = {}
> for i in string.gmatch(text,  "[^%.]+" ) do
> sentences[#sentences+1] = i
>  end
>
> for i = 1, #sentences do
> print(i, sentences[i])
> end
>
> Of course I had forgotten about not splitting on decimal points in numbers.  How can I adapt this to ignore the full stop character if surrounded by numbers?

You can use the pattern (fulllstop, whitespace, capital letter) to
terminate a sentence.

function break_into_sentences(str)
  local prose = {}
  local start = 1
  repeat
    local sentence,found = str:match("(.-)%.%s+()%u",start)
    prose[#prose+1] = sentence
    if found then start=found end
  until not found
  prose[#prose+1] = str:sub(start)
  return prose
end

Of course, there will still be some cases like "Dr. No: where the
author did not mean to end a sentence. That is no longer a string
parsing challenge, but a question of designing an unambiguous grammar.
For example:

* Bernard Shaw 100 years ago already dropped full stops in
abbreviations, and it is fairly common practice nowadays.
* I used to (but no longer do) put two spaces after each genuine fullstop.
* You could put hard spaces after abbreviations.
* You could gsub a list of allowed exceptions replacing a fullstop by
a central dot or other unused Unicode character, and afterwords gsub
them back in.

Etc.

References:
- Tricky little string parsing challenge, Geoff Smith

Prev by Date: Re: Changes in the validation of UTF-8
Next by Date: Re: Changes in the validation of UTF-8
Previous by thread: Re: Tricky little string parsing challenge
Next by thread: Re: Tricky little string parsing challenge
Index(es):
- Date
- Thread