1
votes

I want to process large tables stored in an Azure Data Lake Storage (Gen 1), first running on them a U-SQL script, then a Python script, and finally output the result.

Conceptually this is pretty simple:

  1. Run a .usql script to generate intermediate data (two tables, intermediate_1 and intermediate_2) from a large initial_table
  2. Run a Python script over the intermediate data to generate the final result final

What should be the Azure Machine Learning Pipeline steps to do this?

I thought the following plan would work:

  1. Run the .usql query on a adla_compute using an AdlaStep like

    int_1 = PipelineData("intermediate_1", datastore=adls_datastore)
    int_2 = PipelineData("intermediate_2", datastore=adls_datastore)
    
    adla_step = AdlaStep(script_name='script.usql',
                         source_directory=sample_folder,
                         inputs=[initial_table],
                         outputs=[intermediate_1, intermediate_2],
                         compute_target=adla_compute)          
    
  2. Run a Python step on a compute target aml_compute like

    python_step = PythonScriptStep(script_name="process.py",
                                   arguments=["--input1", intermediate_1, "--input2", intermediate_2, "--output", final],
                                   inputs=[intermediate_1, intermediate_2],
                                   outputs=[final],    
                                   compute_target=aml_compute, 
                                   source_directory=source_directory)
    

This however fails at the Python step with an error of the kind

StepRun(process.py) Execution Summary

======================================
StepRun(process.py) Status: Failed

Unable to mount data store mydatastore because it does not specify a storage account key.

I don't really understand the error complaining about 'mydatastore', which the name tied to the adls_datastore Azure Data Lake data store reference on which I am running the U-SQL queries against.

Can someone smell if I am doing something really wrong here? Should I move the intermediate data (intermediate_1 and intermediate_2) to a storage account, e.g. with a DataTransferStep, before the PythonScriptStep?

2

2 Answers

1
votes

ADLS does not support mount. So, you are right, you will have to use DataTransferStep to move data to blob first.

1
votes

Data Lake store is not supported for AML compute. This table lists different computes and their level of support for different datastores: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data#compute-and-datastore-matrix

You can use DataTransferStep to copy data from ADLS to blob and then use that blob as input for PythonScriptStep. Sample notebook: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-data-transfer.ipynb


# register blob datastore, example in linked notebook
# blob_datastore = Datastore.register_azure_blob_container(...

int_1_blob = DataReference(
    datastore=blob_datastore,
    data_reference_name="int_1_blob",
    path_on_datastore="int_1")

copy_int_1_to_blob = DataTransferStep(
    name='copy int_1 to blob',
    source_data_reference=int_1,
    destination_data_reference=int_1_blob,
    compute_target=data_factory_compute)

int_2_blob = DataReference(
    datastore=blob_datastore,
    data_reference_name="int_2_blob",
    path_on_datastore="int_2")

copy_int_2_to_blob = DataTransferStep(
    name='copy int_2 to blob',
    source_data_reference=int_2,
    destination_data_reference=int_2_blob,
    compute_target=data_factory_compute)

# update PythonScriptStep to use blob data references
python_step = PythonScriptStep(...
                               arguments=["--input1", int_1_blob, "--input2", int_2_blob, "--output", final],
                               inputs=[int_1_blob, int_2_blob],
                               ...)