Can't import installed python modules in spark cluster offered by Azure Databricks

1

votes

I had just began to run python notebooks through spark cluster offered in Azure Databricks. As a requirement, we have installed couple of external packages like spacy and kafka, through both shell command as well as 'Create library' UI in databricks workspace.

python -m spacy download en_core_web_sm

However, every time we run 'import ' , cluster throws 'Module not found' error.

OSError: Can't find Model 'en_core_web_sm'

On top of that, we seem to find no way to know exactly where these modules are being installed. Issue persists despite adding the module path in 'sys.path'.

Please let us know how to fix this as soon as possible

apache-sparkazure-databricks

If my answer is helpful for you, you can accept it as answer( click on the check mark beside the answer to toggle it from greyed out to filled in.). This can be beneficial to other community members. Thank you. - CHEEKATLAPRADEEP-MSFT

1

votes

install spacy "en_core_web_sm" model as

    %sh python -m spacy download en_core_web_sm

the import the model as

    import en_core_web_sm
    nlp = en_core_web_sm.load()
    doc = nlp("My name is Raghu Ram. I live in Kolkata.")
    for ent in doc.ents:
      print(ent.text, ent.label_)

0

votes

You can follow the below steps to install and load spaCy package on Azure Databricks.

Step1: Install spaCy using pip and downloading the spaCy models.

%sh
/databricks/python3/bin/pip install spacy 
/databricks/python3/bin/python3 -m spacy download en_core_web_sm

Notebook output:

Step2: Running the example using spaCy.

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Notebook output:

Hope this helps. Do let us know if you any further queries.

Do click on "Mark as Answer" and Upvote on the post that helps you, this can be beneficial to other community members.

0

votes

use Databricks ML runtime distribution when creating a cluster https://docs.databricks.com/runtime/mlruntime.html

then you can install spacy from Install Library UI (just go to cluster/libraries and install as usual), or via %sh, %pip or %conda

then to load english corpus:

%python

import spacy spacy.cli.download("en_core_web_lg")

Can't import installed python modules in spark cluster offered by Azure Databricks

3 Answers