How to run Python script over result generated with U-SQL script in Azure Machine Learning Pipelines?

Question

I want to process large tables stored in an Azure Data Lake Storage (Gen 1), first running on them a U-SQL script, then a Python script, and finally output the result.

Conceptually this is pretty simple:

Run a .usql script to generate intermediate data (two tables, intermediate_1 and intermediate_2) from a large initial_table
Run a Python script over the intermediate data to generate the final result final

What should be the Azure Machine Learning Pipeline steps to do this?

I thought the following plan would work:

Run the .usql query on a adla_compute using an AdlaStep like

int_1 = PipelineData("intermediate_1", datastore=adls_datastore)
int_2 = PipelineData("intermediate_2", datastore=adls_datastore)

adla_step = AdlaStep(script_name='script.usql',
                     source_directory=sample_folder,
                     inputs=[initial_table],
                     outputs=[intermediate_1, intermediate_2],
                     compute_target=adla_compute)

Run a Python step on a compute target aml_compute like

python_step = PythonScriptStep(script_name="process.py",
                               arguments=["--input1", intermediate_1, "--input2", intermediate_2, "--output", final],
                               inputs=[intermediate_1, intermediate_2],
                               outputs=[final],    
                               compute_target=aml_compute, 
                               source_directory=source_directory)

This however fails at the Python step with an error of the kind

StepRun(process.py) Execution Summary

======================================
StepRun(process.py) Status: Failed

Unable to mount data store mydatastore because it does not specify a storage account key.

I don't really understand the error complaining about 'mydatastore', which the name tied to the adls_datastore Azure Data Lake data store reference on which I am running the U-SQL queries against.

Can someone smell if I am doing something really wrong here? Should I move the intermediate data (intermediate_1 and intermediate_2) to a storage account, e.g. with a DataTransferStep, before the PythonScriptStep?

Santhosh Pillai Santhosh Pillai · Accepted Answer · 2019-08-05T20:52:02

ADLS does not support mount. So, you are right, you will have to use DataTransferStep to move data to blob first.

How to run Python script over result generated with U-SQL script in Azure Machine Learning Pipelines?

2 Answers