0
votes

I have deployed a HDInsight 3.5 Spark (2.0) cluster on Microsoft Azure with the standard configurations (Location = US East, Head Nodes = D12 v2 (x2), Worker Nodes = D4 v2 (x4)). When the cluster is running, I connect to Jupyter notebook and I try to import an own created module.

import own_module

This unfortunately does not work, so I tried to 1) upload own_module.py in Jupyter Notebook home and 2) added own_module.py to /home/sshuser via ssh connection. Afterwards I added /home/sshuser to the sys.path and PYTHONPATH:

sys.path.append('/home/sshuser')
os.environ['PYTHONPATH'] = os.environ['PYTHONPATH'] + ':/home/sshuser'

This manipulation also does not work. And the error still shows:

No module named own_module
Traceback (most recent call last):
ImportError: No module named own_module

Could someone tell how to I can import own modules? Preferably by putting them in Azure blob storage and afterwards transferring them to the HDInsight cluster.

1

1 Answers

1
votes

You can use spark context's addPyFile method. First put the file into Azure blob storage, then copy the public http/https address and use this URL into addPyFile function. The module will be accesible on driver and all executors.