Partition Parquet files on Azure Blob (pyarrow)

Question

I have been manually partitioning files with pandas (creating an index or multi-index and then writing a separate parquet file for each index in a loop) to Azure Blob.

However, when reading the docs for pyarrow, I see that it is possible to create a 'dataset' which includes a folder structure for partitioned data. https://arrow.apache.org/docs/python/parquet.html

The example for the Monthly / daily folder is exactly what I am trying to achieve.

dataset_name/
  year=2007/
    month=01/
       0.parq
       1.parq
       ...
    month=02/
       0.parq
       1.parq
       ...
    month=03/
    ...
  year=2008/
    month=01/
    ...



fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
pq.write_to_dataset(table, root_path='dataset_name',
                    partition_cols=['one', 'two'], filesystem=fs)

Can I do this with Azure Blob (or Minio which uses S3 and wraps over my Azure Blob storage)? My ultimate goal is to only read files which make sense for my 'query'.

What's your environment to run your Python script using pyarrow? Such as LInux on Azure VM or on local? — Peter Pan
Linux on an Azure VM (I installed all software myself though - everything is mostly docker containers at this point using some popular images). Python is current from the Jupyterhub image — ldacey

Peter Pan Peter Pan · Accepted Answer · 2019-10-31T08:02:13

Just per my experience and based on your current environment Linux on Azure VM, I think there are two solutions can read partition parquet files from Azure Storage.

Follow the section Reading a Parquet File from Azure Blob storage of the document Reading and Writing the Apache Parquet Format of pyarrow, manually to list the blob names with the prefix like dataset_name using the API list_blob_names(container_name, prefix=None, num_results=None, include=None, delimiter=None, marker=None, timeout=None) of Azure Storgae SDK for Python as the figure below, then to read these blobs one by one like the sample code to dataframes, finally to concat these dataframes to a single one.
Try to use Azure/azure-storage-fuse to mount a container of Azure Blob Storage to your Linux filesystem, then you just need to follow the document section Reading from Partitioned Datasets to read the Partitioned Dataset locally from Azure Blob Storage.

Partition Parquet files on Azure Blob (pyarrow)

1 Answers