I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:
text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'
I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.
At the moment, the following code returns three separate sentences:
nlp = spacy.load("en_core_web_sm")
text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
print("****", sent.text)
This returns:
**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:
if token.text == r'\d\.':
doc[token.i+1].is_sent_start = False
This is doesn't seem to have any effect. Has anyone come across this problem before?