How can I tokenize this text into sentences with Regex

Question

"You could not possibly have come at a better time, my dear Watson," he said cordially. 'It is not worth your while to wait,' she went on."You can pass through the door; no one hinders." And then, seeing that I smiled and shook my head, she suddenly threw aside her constraint and made a step forward, with her hands wrung together.

Look at the highlighted area. How can I possibly distinguish a case where '"' is followed by a period (.) to end a sentence and a case where a period (.) is followed by a '"'

I have tried this piece for the tokenizer. It works well except for just that one part.

(([^।\.?!]|[।\.?!](?=[\"\']))+\s*[।\.?!]\s*)

Edit: I am not planning to use any NLP toolkit to solve this problem.

Do you realize it is not a task for a single regex? A solution for this includes much more than what you "tried". — Wiktor Stribiżew
Wiktor this is a simple Hackerrank challenge which just allows the re module for Python. — djokester

Jan Jan · Accepted Answer · 2017-05-26T19:56:56

Use NLTK instead of regular expressions here:

from nltk import sent_tokenize
parts = sent_tokenize(your_string)
# ['"You could not possibly have come at a better time, my dear Watson," he said cordially.', "'It is not worth your while to wait,' she went on.", '"You can pass through the door; no one hinders."', 'And then, seeing that I smiled and shook my head, she suddenly threw aside her constraint and made a step forward, with her hands wrung together.']

How can I tokenize this text into sentences with Regex

2 Answers