How to prepare data for spacy's custom named entity recognition?

Question

I'm trying to prepare a training dataset for custom named entity recognition using spacy. My data has a variable 'Text', which contains some sentences, a variable 'Names', which has names of people from the previous variable (sentences). After going through some examples and spacy's documentation, I realised that one has to pass index of the entity while preparing the dataset. I want to know if there's any way to pass the entity as a string directly while preparing the dataset ?

Reference: "https://medium.com/@manivannan_data/how-to-train-ner-with-custom-training-data-using-spacy-188e0e508c6"

sinanggul sinanggul · Accepted Answer · 2019-08-08T15:25:23

No, spaCy will need exact start & end indices for your entity strings, since the string by itself may not always be uniquely identified and resolved in the source text. Examples:

Apple is usually an ORG, but can be a PERSON.
Ann is a PERSON, but not in Annotation tools are best for this purpose.

In python, you can use the re module to grab the indices:

>>> import re
>>> [m.span() for m in re.finditer('Amazon', 'The Amazon is a river in South America.  Amazon Inc is a company.')]
[(4, 10), (41, 47)]

You will have to go through and verify the indices before creating your spaCy training set.

How to prepare data for spacy's custom named entity recognition?

1 Answers