8
votes

I've trained a custom NER model in spaCy with a custom tokenizer. I'd like to save the NER model without the tokenizer. I tried the following code with I found in the spaCy support forum:

import spacy

nlp = spacy.load("en")
nlp.tokenizer = some_custom_tokenizer
# Train the NER model...
nlp.tokenizer = None
nlp.to_disk('/tmp/my_model', disable=['tokenizer'])

When I try to load it, the pipeline is empty, and surprisingly, is has the default spaCy tokenizer.

nlp = spacy.blank('en').from_disk('/tmp/model', disable=['tokenizer'])

Any idea how can I load the model without the tokenizer, but get the full pipeline? thanks

1
Off topic: @Gino, I just noticed your question and I'm looking for people who are experienced in the development of NLP applications. I provide a framework which aims to make the development of custom NLP models easier. It is called NLPf and provides, for example, an annotation tool which makes the annotation process much easier (maybe you can benefit from it as well while you increase your training data). Do you have same time to experiment with the framework and answer a questionnaire? - schrieveslaach
@Schrieveslaach Sure! - Gino
could you reach out to me via e-mail to discuss how I could provide any guidance? You can find my e-mail, for example, in this commit. Just hover over my username. - schrieveslaach

1 Answers

10
votes

You can use nlp = spacy.load('/tmp/model') to load your model after you saved it to disk. Doing what you did apparently only loads the binary data according to the Spacy documentation (https://spacy.io/usage/training#section-saving-loading)