I want to process large tables stored in an Azure Data Lake Storage (Gen 1), first running on them a U-SQL script, then a Python script, and finally output the result.
Conceptually this is pretty simple:
- Run a .usql script to generate intermediate data (two tables,
intermediate_1
andintermediate_2
) from a largeinitial_table
- Run a Python script over the intermediate data to generate the final result
final
What should be the Azure Machine Learning Pipeline steps to do this?
I thought the following plan would work:
Run the .usql query on a
adla_compute
using anAdlaStep
likeint_1 = PipelineData("intermediate_1", datastore=adls_datastore) int_2 = PipelineData("intermediate_2", datastore=adls_datastore) adla_step = AdlaStep(script_name='script.usql', source_directory=sample_folder, inputs=[initial_table], outputs=[intermediate_1, intermediate_2], compute_target=adla_compute)
Run a Python step on a compute target
aml_compute
likepython_step = PythonScriptStep(script_name="process.py", arguments=["--input1", intermediate_1, "--input2", intermediate_2, "--output", final], inputs=[intermediate_1, intermediate_2], outputs=[final], compute_target=aml_compute, source_directory=source_directory)
This however fails at the Python step with an error of the kind
StepRun(process.py) Execution Summary
======================================
StepRun(process.py) Status: FailedUnable to mount data store mydatastore because it does not specify a storage account key.
I don't really understand the error complaining about 'mydatastore', which the name tied to the adls_datastore
Azure Data Lake data store reference on which I am running the U-SQL queries against.
Can someone smell if I am doing something really wrong here?
Should I move the intermediate data (intermediate_1
and intermediate_2
) to a storage account, e.g. with a DataTransferStep
, before the PythonScriptStep
?