1
votes

It is possible get schema of parquet file using Azure Function in Python without download file from datalake ? I using BlobStorageClient to connect to data lake and get the files and containers but i have no idea how can i dispatcher the command using for example pyarrow.

About pyarrow: https://arrow.apache.org/docs/python/parquet.html

BlobStorageClient: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python-legacy

2
Do you mind if using stream to implement it?Jim Xu

2 Answers

2
votes

Regarding the issue, please refer to the following script

import pyarrow.parquet as pq
import io
from azure.storage.blob import BlobServiceClient

blob_service_client = BlobServiceClient.from_connection_string(conn_str)
container_client = blob_service_client.get_container_client('test')
blob_client = container_client.get_blob_client('test.parquet')

with io.BytesIO() as f:
    download_stream = blob_client.download_blob(0)
    download_stream.readinto(f)
    schema = pq.read_schema(f)
    print(schema)

0
votes

It is possible to read both parquet schema and parquet metadata without reading the file content using read_schema and read_metadata:

import pyarrow.parquet as pq

fname = 'filename.parquet'
meta = pq.read_metadata(fname)
schema = pq.read_schema(fname)