2
votes

I've been trying out spaCy for a small side-project, and had a few questions & concerns.

I noticed that spaCy's named-entity recognition results (with its largest en_vectors_web_lg model) don't seem to be as accurate as that of Google Cloud Natural Language API [1]. Google's API is able to extract more entities, more accurately, most likely because their model is even larger. So, is there a way to improve spaCy's NER results using a different model if possible, or through some other technique?

Secondly, Google's API also returns Wikipedia article links for relevant entities. Is this possible with spaCy too, or using some other technique on top of spaCy's NER results?

Thirdly, I noticed that spaCy has a similarity() method [2] that uses GloVe word vectors. But being new to it, I'm not sure what's the best way to frequently perform similarity comparison between each document in a set of documents (say 5000-10000 text documents of under 500 characters each) to generate buckets of similar documents?

Hoping for someone to have any suggestions or tips.

Many thanks!


[1] https://cloud.google.com/natural-language/

[2] https://spacy.io/usage/vectors-similarity

1

1 Answers

1
votes

...So is there way to improve spaCy's NER?

It is possible to train the spaCy's model to improve it's NER. You can use GoldParse object to train it. https://spacy.io/usage/training

Secondly, Google's API also returns Wikipedia article links for relevant entities. Is this possible with spaCy too, or using some other technique on top of spaCy's NER results?

I have not seen anyone trying this feature with spaCy.

Thirdly, I noticed that spaCy has a similarity() method [2] that uses GloVe word vectors...

I think this a clustering problem and will not be solved just using spaCy similarity. For clustering, I would highly recommend going through the following link. http://brandonrose.org/clustering