4
votes

I am trying to read a xlsx file from an Azure blob storage to a pandas dataframe without creating a temporary local file. I have seen many similar questions, e.g. Issues Reading Azure Blob CSV Into Python Pandas DF, but haven't managed to get the proposed solutions to work.

Below code snippet results in a UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 14: invalid start byte exception.

from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient, BlobServiceClient

blob_client = BlobClient.from_blob_url(blob_url = url + container + "/" + blobname, credential = token)   
blob = blob_client.download_blob().content_as_text()   
df = pd.read_excel(StringIO(blob))

Using a temporary file, I do manage to make it work with the following code snippet:

blob_service_client = BlobServiceClient(account_url = url, credential = token)
blob_client = blob_service_client.get_blob_client(container=container, blob=blobname)

with open(tmpfile, "wb") as my_blob:
    download_stream = blob_client.download_blob()
    my_blob.write(download_stream.readall())

data = pd.read_excel(tmpfile)
1

1 Answers

2
votes

Similar to what you have already done, we could use download_blob() to get the StorageStreamDownloader object into memory, then context_as_text() to decode the contents to a string.

Then we can read the the contents from the CSV StringIO buffer into a pandas Dataframe with pandas.read_csv().

from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient, BlobServiceClient
import os

connection_string = os.getenv('AZURE_STORAGE_CONNECTION_STRING')

blob_service_client = BlobServiceClient.from_connection_string(connection_string)

blob_client = blob_service_client.get_blob_client(container="blobs", blob="test.csv")

blob = blob_client.download_blob().content_as_text()

df = pd.read_csv(StringIO(blob))

Update

If we are working with XLSX files, use content_as_bytes() to return bytes instead of a string, and convert to a pandas dataframe with pandas.read_excel():

from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient, BlobServiceClient
import os

connection_string = os.getenv('AZURE_STORAGE_CONNECTION_STRING')

blob_service_client = BlobServiceClient.from_connection_string(connection_string)

blob_client = blob_service_client.get_blob_client(container="blobs", blob="test.xlsx")

blob = blob_client.download_blob().content_as_bytes()

df = pd.read_excel(blob)

Since content_as_text() uses UTF-8 encoding by default, this is probably causing the UnicodeDecodeError when decoding bytes.

We could still use this with pandas.read_excel() if we set the encoding to None:

blob = blob_client.download_blob().content_as_text(encoding=None)

df = pd.read_excel(blob)