1
votes

"You could not possibly have come at a better time, my dear Watson," he said cordially. 'It is not worth your while to wait,' she went on."You can pass through the door; no one hinders." And then, seeing that I smiled and shook my head, she suddenly threw aside her constraint and made a step forward, with her hands wrung together.

Look at the highlighted area. How can I possibly distinguish a case where '"' is followed by a period (.) to end a sentence and a case where a period (.) is followed by a '"'

I have tried this piece for the tokenizer. It works well except for just that one part.

(([^।\.?!]|[।\.?!](?=[\"\']))+\s*[।\.?!]\s*)

Edit: I am not planning to use any NLP toolkit to solve this problem.

2
Do you realize it is not a task for a single regex? A solution for this includes much more than what you "tried".Wiktor Stribiżew
Wiktor this is a simple Hackerrank challenge which just allows the re module for Python.djokester
So, it has no practical value.Wiktor Stribiżew

2 Answers

1
votes

Use NLTK instead of regular expressions here:

from nltk import sent_tokenize
parts = sent_tokenize(your_string)
# ['"You could not possibly have come at a better time, my dear Watson," he said cordially.', "'It is not worth your while to wait,' she went on.", '"You can pass through the door; no one hinders."', 'And then, seeing that I smiled and shook my head, she suddenly threw aside her constraint and made a step forward, with her hands wrung together.']
0
votes

Found this function a while ago

def split_into_sentences(text):

caps = u"([A-Z])"
prefixes = u"(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = u"(Inc|Ltd|Jr|Sr|Co)"
starters = u"(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = u"([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = u"[.](com|net|org|io|gov|mobi|info|edu)"

if not isinstance(text,unicode):
    text = text.decode('utf-8')

text = u" {0} ".format(text)

text = text.replace(u"\n",u" ")
text = re.sub(prefixes,u"\\1<prd>",text)
text = re.sub(websites,u"<prd>\\1",text)
if u"Ph.D" in text: text = text.replace(u"Ph.D.",u"Ph<prd>D<prd>")
text = re.sub(u"\s" + caps + u"[.] ",u" \\1<prd> ",text)
text = re.sub(acronyms+u" "+starters,u"\\1<stop> \\2",text)
text = re.sub(caps + u"[.]" + caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(caps + u"[.]" + caps + u"[.]",u"\\1<prd>\\2<prd>",text)
text = re.sub(u" "+suffixes+u"[.] "+starters,u" \\1<stop> \\2",text)
text = re.sub(u" "+suffixes+u"[.]",u" \\1<prd>",text)
text = re.sub(u" " + caps + u"[.]",u" \\1<prd>",text)
if u"\"" in text: text = text.replace(u".\"",u"\".")
if u"!" in text: text = text.replace(u"!\"",u"\"!")
if u"?" in text: text = text.replace(u"?\"",u"\"?")
text = text.replace(u".",u".<stop>")
text = text.replace(u"?",u"?<stop>")
text = text.replace(u"!",u"!<stop>")
text = text.replace(u"<prd>",u".")
sentences = text.split(u"<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
return sentences