I'm trying to read a partitioned parquet directory stored in an s3 bucket.
For the sake of this question, let's call the bucket bucket
. The bucket has one folder which has subsequent partitions based on year/month/day/hour.
So, if a .parquet file was to be reached, the url would be:
s3://bucket/folder/year/month/day/hour
I tried to read it as I would do for any other parquet file. I've been working with them recently. However, I hadn't tried reading a partitioned file so far.
I've included my sample code below:
import s3fs
import pandas as pd
import boto3
# Creating an S3 Filesystem (Only required when using S3)
s3 = s3fs.S3FileSystem()
s3_path = "s3://bucket"
directory = 'folder'
# Loading Files (S3)
data = pq.ParquetDataset(f'{s3_path}/{directory}', filesystem = s3).read_pandas().to_pandas()
This is the flow I've used and I know it works for general parquet files. Now, the error that I get is this:
ValueError: Directory name did not appear to be a partition: 2019
I've already tried to dive into 2019
since I figured that the first level only had 2019
as a folder so it might think it was a subdirectory and not a partition.
The path then looked like s3://bucket/folder/2019
However, that gave me the following error, along similar lines:
ValueError: Directory name did not appear to be a partition: 05
I've also tried using fastparquet
following an approach from this question: How to read partitioned parquet files from S3 using pyarrow in python
That didn't work either. If I tried printing the list of files using the all_paths_from_s3
from the answer to the question above, it gave me a blank list []
.