I have been manually partitioning files with pandas (creating an index or multi-index and then writing a separate parquet file for each index in a loop) to Azure Blob.
However, when reading the docs for pyarrow, I see that it is possible to create a 'dataset' which includes a folder structure for partitioned data. https://arrow.apache.org/docs/python/parquet.html
The example for the Monthly / daily folder is exactly what I am trying to achieve.
dataset_name/
year=2007/
month=01/
0.parq
1.parq
...
month=02/
0.parq
1.parq
...
month=03/
...
year=2008/
month=01/
...
fs = pa.hdfs.connect(host, port, user=user, kerb_ticket=ticket_cache_path)
pq.write_to_dataset(table, root_path='dataset_name',
partition_cols=['one', 'two'], filesystem=fs)
Can I do this with Azure Blob (or Minio which uses S3 and wraps over my Azure Blob storage)? My ultimate goal is to only read files which make sense for my 'query'.