I am a NLP novice trying to learn, and would like to better understand how Named Entity Recognition (NER) is implemented in practice, for example in popular python libraries such as spaCy.
I understand the basic concept behind it, but I suspect I am missing some details. From the documentation, it is not clear to me for example how much preprocessing is done on the text and annotation data; and what statistical model is used.
Do you know if:
- In order to work, the text has to go through chunking before the model is trained, right? Otherwise it wouldn't be able to perform anything useful?
- Are the text and annotations typically normalized prior to the training of the model? So that if a named entity is at the beginning or middle of a sentence it can still work?
- Specifically in spaCy, how are things implemented concretely? Is it a HMM, CRF or something else that is used to build the model?
Apologies if this is all trivial, I am having some trouble finding easy to read documentation on NER implementations.