Preventing spaCy splitting paragraph numbers into sentences

Question

I'm using spaCy to do sentence segmentation on texts that using paragraph numbering, for example:

text = '3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity.'

I'm trying to force spaCy's sentence segmenter to not split the 3. into a sentence of it's own.

At the moment, the following code returns three separate sentences:

nlp = spacy.load("en_core_web_sm")

text = """3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity."""
doc = nlp(text)
for sent in doc.sents:
    print("****", sent.text)

This returns:

**** 3.
**** English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.

I've been trying to stop this from happening by passing a custom rule into the pipeline before the parser:

if token.text == r'\d\.':
    doc[token.i+1].is_sent_start = False

This is doesn't seem to have any effect. Has anyone come across this problem before?

While this does not answer the question, as this is about SpaCy, I may suggest my own sentence segmentation and tokenization tool, segtok, and its latest incarnation, "segtok version 2", syntok. Neither splits sentences at enumerations, and syntok even fixes cases like "This is a sentence.And here we forgot a space.", while the token stream retains the original input, and being a very performant, production-ready, high-quality sentence segmenter for at least English, Spanish, and German. You might want to take a look. — fnl

Srce Cde Srce Cde · Accepted Answer · 2018-11-22T14:20:05

Something like this?

text = ["""3. English law takes a dim view of stealing stuff from the shops. Some may argue that this is a pity. Are you upto something?""", 
        """4. It's hilarious and I think this can be more of a political moment. Don't you think so? Will Robots replace humans?"""]
for i in text:
    doc = nlp(i)
    span = doc[0:5]
    span.merge()
    for sent in doc.sents:
        print("****", sent.text)
    print("\n")

Output:

**** 3. English law takes a dim view of stealing stuff from the shops.
**** Some may argue that this is a pity.
**** Are you upto something?


**** 4. It's hilarious and I think this can be more of a political moment.
**** Don't you think so?
**** Will Robots replace humans?

Reference: span.merge()

Preventing spaCy splitting paragraph numbers into sentences

1 Answers