I've been trying out spaCy for a small side-project, and had a few questions & concerns.
I noticed that spaCy's named-entity recognition results (with its largest en_vectors_web_lg model) don't seem to be as accurate as that of Google Cloud Natural Language API [1]. Google's API is able to extract more entities, more accurately, most likely because their model is even larger. So, is there a way to improve spaCy's NER results using a different model if possible, or through some other technique?
Secondly, Google's API also returns Wikipedia article links for relevant entities. Is this possible with spaCy too, or using some other technique on top of spaCy's NER results?
Thirdly, I noticed that spaCy has a similarity() method [2] that uses GloVe word vectors. But being new to it, I'm not sure what's the best way to frequently perform similarity comparison between each document in a set of documents (say 5000-10000 text documents of under 500 characters each) to generate buckets of similar documents?
Hoping for someone to have any suggestions or tips.
Many thanks!