1
votes

Using build tool (setuptools) packaged my python code as .egg format. I wanted to run this package through job in azure data-bricks.

I can able to execute the package in my local machine through below commands.

spark-submit --py-files ./dist/hello-1.0-py3.6.egg hello/pi.py

1) Copied the package into DBFS path as follows,

work-space -> User -> Create -> Library -> Library Source (DBFS) -> Library Type (Python Egg) -> Uploaded

2) Created a job with task as spark-submit on new cluster mode

3) Below parameters are configured for the task,

["--py-files","dbfs:/FileStore/jars/8c1231610de06d96-hello_1_0_py3_6-70b16.egg","hello/pi.py"]

Actual: /databricks/python/bin/python: can't open file '/databricks/driver/hello/hello.py': [Errno 2] No such file or directory

Expected: Job should execute successfully.

1
Did you install 8c1231610de06d96-hello_1_0_py3_6-70b16.egg? Do you create a new cluster?Eric Bellet

1 Answers

0
votes

The only way I've got this to work is by using the API to create a Python Job. The UI does not support this for some reason.

I use PowerShell to work with the API - this is an example that creates a job using an egg which works for me:

$Lib = '{"egg":"LOCATION"}'.Replace("LOCATION", "dbfs:$TargetDBFSFolderCode/pipelines.egg")
$ClusterId = "my-cluster-id"
$j = "sample"
$PythonParameters = "pipelines.jobs.cleansed.$j"
$MainScript = "dbfs:" + $TargetDBFSFolderCode + "/main.py"
Add-DatabricksDBFSFile -BearerToken $BearerToken -Region $Region -LocalRootFolder "./bin/tmp" -FilePattern "*.*"  -TargetLocation $TargetDBFSFolderCode -Verbose
Add-DatabricksPythonJob -BearerToken $BearerToken -Region $Region -JobName "$j-$Environment" -ClusterId $ClusterId `
    -PythonPath $MainScript -PythonParameters $PythonParameters -Libraries $Lib -Verbose

That copies my main.py and pipelines.egg to DBFS then creates a job pointed at them passing in a parameter.

One annoying thing about eggs on Databricks - you must uninstall and restart the cluster before it picks up any new versions that you deploy.

If you use an engineering cluster this is not an issue.