0
votes

How can i read Data from multiple files stored in Azure Blob storage with azure machine learning studio "At once"?

I tried to use the Reader module and it's work just fine with one file, can it be useful for more than one, or do i have to look for an other solution ?

Thank you for your Help !

3

3 Answers

1
votes

If there are not so many blobs, you could just add multiple readers with each map to one of your input blobs. Then use modules under "Data Transformation" -> "Manipulation" to do things like "Add Rows" or "Join".

0
votes

Use a lot of readers which read from different blobs and then connect them to a MetaData Editor.

0
votes

Although the approach to use multiple Reader modules will work, it become wildly difficult when there are many inputs, or the number of inputs is varied.

Instead, you can use the Execute Python Script module to directly access blob storage. Doing so, however, is exceedingly painful if you've never done it before. Here's are the issues:

  1. The azure.storage.blob Python package is not loaded by default into Azure ML. However, this can be created manually, or downloaded from the link below (correct version as of Feb 11, 2016).
  2. The default usage of azure.storage.blob.BlobService uses HTTPS, which is not currently supported in Azure ML blob storage access. For this, you can pass in protocol='http' during the BlobService creation to force the use of HTTP: client = BlobService(STORAGE_ACCOUNT, STORAGE_KEY, protocol="http")

Here are the steps to get it working:

  1. Download azure.zip which provides the required azure.storage.* libraries: https://azuremlpackagesupport.blob.core.windows.net/python/azure.zip
  2. Upload them as a DataSet to the Azure ML Studio
  3. Connect them to the Zip input on an Execute Python Script module, which is the 3rd input.
  4. Write your script as you would normally, being sure to create your BlobService object with protocol='http'
  5. Run the Experiment - you should now be able to read and write to blob storage.

Some example code can be found here: https://gist.github.com/drdarshan/92fff2a12ad9946892df

Here is the code to make it work for a single file. This can be extended to work with numerous files by accessing a container and filtering, but that will depend on your business logic.

from azure.storage.blob import BlobService

def azureml_main(dataframe1 = None, dataframe2 = None):
    account_name = 'mystorageaccount'
    account_key='p8kSy3FACx...redacted...ebz3plQ=='
    container_name = "upload"

    blob_service = BlobService(account_name, account_key, protocol='http')

    file = blob_service.get_blob_to_text(container_name,'myfile.txt')
    # You can also get_blob_to_(bytes|file|path), if you need to do so.

    # Do stuff with your file here
    #   Logic, logic, logic

    # Execute Python Script requires that a dataframe is returned. It can be null.
    # Return value must be of a sequence of pandas.DataFrame
    return dataframe1,

For further information on limitations, why HTTP, and other notes, see Access Azure blog storage from within an Azure ML experiment