Running python package .egg in Azure Databricks Job

Question

Using build tool (setuptools) packaged my python code as .egg format. I wanted to run this package through job in azure data-bricks.

I can able to execute the package in my local machine through below commands.

spark-submit --py-files ./dist/hello-1.0-py3.6.egg hello/pi.py

1) Copied the package into DBFS path as follows,

work-space -> User -> Create -> Library -> Library Source (DBFS) -> Library Type (Python Egg) -> Uploaded

2) Created a job with task as spark-submit on new cluster mode

3) Below parameters are configured for the task,

["--py-files","dbfs:/FileStore/jars/8c1231610de06d96-hello_1_0_py3_6-70b16.egg","hello/pi.py"]

Actual: /databricks/python/bin/python: can't open file '/databricks/driver/hello/hello.py': [Errno 2] No such file or directory

Expected: Job should execute successfully.

Did you install 8c1231610de06d96-hello_1_0_py3_6-70b16.egg? Do you create a new cluster? — Eric Bellet

simon_dmorias simon_dmorias · Accepted Answer · 2019-05-31T16:16:55

The only way I've got this to work is by using the API to create a Python Job. The UI does not support this for some reason.

I use PowerShell to work with the API - this is an example that creates a job using an egg which works for me:

$Lib = '{"egg":"LOCATION"}'.Replace("LOCATION", "dbfs:$TargetDBFSFolderCode/pipelines.egg")
$ClusterId = "my-cluster-id"
$j = "sample"
$PythonParameters = "pipelines.jobs.cleansed.$j"
$MainScript = "dbfs:" + $TargetDBFSFolderCode + "/main.py"
Add-DatabricksDBFSFile -BearerToken $BearerToken -Region $Region -LocalRootFolder "./bin/tmp" -FilePattern "*.*"  -TargetLocation $TargetDBFSFolderCode -Verbose
Add-DatabricksPythonJob -BearerToken $BearerToken -Region $Region -JobName "$j-$Environment" -ClusterId $ClusterId `
    -PythonPath $MainScript -PythonParameters $PythonParameters -Libraries $Lib -Verbose

That copies my main.py and pipelines.egg to DBFS then creates a job pointed at them passing in a parameter.

One annoying thing about eggs on Databricks - you must uninstall and restart the cluster before it picks up any new versions that you deploy.

If you use an engineering cluster this is not an issue.

Running python package .egg in Azure Databricks Job

1 Answers