1
votes

I want to use spaCy's Matcher class on a new language (Hebrew) for which spaCy does not yet have a working model.

I found a working tokenizer + POS tagger (from Stanford NLP), yet I would prefer spaCy as its Matcher can help me do some rule-based NER.

Can the rule-based Matcher be fed with POS-tagged text instead of the standard NLP pipeline?

4

4 Answers

2
votes

You can set the words and tags for a spacy document from another source by hand and then use the Matcher. Here's an example using English words/tags just to demonstrate:

from spacy.lang.he import Hebrew
from spacy.tokens import Doc
from spacy.matcher import Matcher

words = ["my", "words"]
tags = ["PRP$", "NNS"]

nlp = Hebrew()
doc = Doc(nlp.vocab, words=words)
for i in range(len(doc)):
    doc[i].tag_ = tags[i]

# This is normally set by the tagger. The Matcher validates that
# the Doc has been tagged when you use the `"TAG"` attribute.
doc.is_tagged = True

matcher = Matcher(nlp.vocab)
pattern = [{"TAG": "PRP$"}]
matcher.add("poss", None, pattern)
print(matcher(doc))
# [(440, 0, 1)]
1
votes

Sincy I am using stanfordnlp - there seem to be a gap closer :-)

https://github.com/explosion/spacy-stanfordnlp

0
votes

If you want to train a new statistical model with spaCy, you should read the documentation on Training spaCy’s Statistical Models.

0
votes

As far as I know, spaCy does not have a trained model for Hebrew yet. For you to use languages without models,

from spacy.lang.he import Hebrew
nlp = Hebrew()
#or
nlp = spacy.blank("he")

Pretty sure you can build you rule-based matcher from here.