1
votes

I want to utilize an AZURE Function app to read an XLSX File from an AZURE BLOB Storage. The Function shall be called by a REST API call. I can access the blob and download the file but I'm struggling with reading the content of the file with pandas directly. I'm searching and trying for hours but can't find a solution. My latest approach looks like this:

def main(req: func.HttpRequest) -> func.HttpResponse:
        logging.info('Python HTTP trigger function processed a request.')

        blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
        container_client = blob_service_client.get_container_client(CONTAINERNAME)
        blob_client = blob_service_client.get_blob_client(container = CONTAINERNAME, blob=BLOBNAME)
        blob = BlobClient(ACCOUNT_URL, CONTAINERNAME, BLOBNAME)

        #READ PRODUCTS FILE
        blob_client.download_blob().readinto(LOCALFILENAME)
        df = pd.read_excel(blob_client.download_blob())

On the MS homepage*, there is one example to download a file from a blob and process it afterwards but as I'm utilizing a function app it doesn't make sense to download the file first if I'm not missing anything...

*https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python

The autocomplete for blob gives me the following options but they are also not really helpful: Autocomplete suggestions for blob

The error messages are different depending on the way I try to read the file. The current one is:

System.Private.CoreLib: Exception while executing function: Functions.TrainProductModel. System.Private.CoreLib: Result: Failure
Exception: AttributeError: 'str' object has no attribute 'write'

But I think there is actually something fundamentally wrong with my approach. The desired result is to read the file directly into a pandas table.

Appreciating any support as this blocking progress with my master thesis :/

1
Have you tried df = pd.read_excel(blob_client.download_blob().readinto())? I'm not familiar with pandas, nor the blob lib for python, but reading the docs for read_excel and download_blob, it seems you have to use readinto() to download the blob into a stream, which read_excel() seems to be accepting as a valid argument for the io parameter.eli
you need to give 'r' and filename, and also, you need to install old version of xlrd. Please check my answer.Bowman Zhu

1 Answers

1
votes

Pandas itself does not have the ability to parse xlsx files. Pandas parses xlsx files based on the external library xlrd. And you shouldn't install the high version of xlrd, because the high version cancels the support for xlsx files (only xls files are supported), and the recommended version is 1.2.0 (this is valid for me).

Below is my code:

import os, uuid
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
import pandas as pd

CONNECTION_STRING = "DefaultEndpointsProtocol=https;AccountName=0730bowmanwindow;AccountKey=xxxxxx;EndpointSuffix=core.windows.net"
CONTAINERNAME = "test"
BLOBNAME = "test.xlsx"
LOCALFILENAME = "testx.xlsx" 

blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = blob_service_client.get_container_client(CONTAINERNAME)
blob_client = blob_service_client.get_blob_client(container = CONTAINERNAME, blob=BLOBNAME)

#READ PRODUCTS FILE
f = open(LOCALFILENAME, "wb")
f.write(blob_client.download_blob().content_as_bytes())
f.close()
df = pd.read_excel(r''+LOCALFILENAME)
print(df)

And it works on my side:

enter image description here