Error in sentence splitting from documents using spacy sentencizer

Question

I am using the sentencizer from spacy to split the document into sentences. The default delimiters in sentencizer is (' . ' , ' ! ' , ' ? ' ). But if i gave a sentence like:

"A fawn was racing in the forest!He was ahead of the rabbit?He was ahead of the elephant."

Its not splitting into 3 sentences.

I tried this :

sen = "A fawn was racing in the forest!He was ahead of the rabbit?He       
was ahead of the elephant." 
nlp = spacy.load('en')
nlp.add_pipe(nlp.create_pipe('sentencizer'), first=True)
doc = nlp(sen)
sentences = [sent.string.strip() for sent in doc.sents]

But it's not splitting in the !, ?.

The expected output for the input:

"A fawn was racing in the forest!He was ahead of the rabbit?He was ahead of the elephant."

"A fawn was racing in the forest!"

"He was ahead of the rabbit?"

"He was ahead of the elephant."

Can anyone help for this.

Thanks in advance.

ETL ETL · Accepted Answer · 2019-11-14T23:27:41

I had a similar problem with sentences missing space after the period when preceded by a quote, like:

He told me "Go somewhere else".But I didn't want to go.

The solution is in this document: Customizing spaCy’s Tokenizer class

I got further inspiration from Adding Custom Tokenization Rules to spaCy

Here is a rule that works - basically copied from Adding Custom Tokenization Rules to spaCy:

import re
import spacy
from spacy.tokenizer import Tokenizer

def custom_tokenizer(nlp):

    prefixes_re = spacy.util.compile_prefix_regex(nlp.Defaults.prefixes)

    custom_infixes = ['\.\.\.+', '(?<=[0-9])-(?=[0-9])', '[?!&:,()]']
    infix_re = spacy.util.compile_infix_regex(tuple(list(nlp.Defaults.infixes) + custom_infixes))

    suffix_re = spacy.util.compile_suffix_regex(nlp.Defaults.suffixes)   

    return Tokenizer(nlp.vocab, nlp.Defaults.tokenizer_exceptions,
                     prefix_search = prefixes_re.search, 
                     infix_finditer = infix_re.finditer, suffix_search = suffix_re.search,
                     token_match=None)


nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

Error in sentence splitting from documents using spacy sentencizer

1 Answers