1
votes

I'm trying to read a partitioned parquet directory stored in an s3 bucket.

For the sake of this question, let's call the bucket bucket. The bucket has one folder which has subsequent partitions based on year/month/day/hour.

So, if a .parquet file was to be reached, the url would be:

s3://bucket/folder/year/month/day/hour

I tried to read it as I would do for any other parquet file. I've been working with them recently. However, I hadn't tried reading a partitioned file so far.

I've included my sample code below:

import s3fs
import pandas as pd
import boto3

# Creating an S3 Filesystem (Only required when using S3)

s3 = s3fs.S3FileSystem()
s3_path = "s3://bucket"
directory = 'folder'

# Loading Files (S3)

data = pq.ParquetDataset(f'{s3_path}/{directory}', filesystem = s3).read_pandas().to_pandas()

This is the flow I've used and I know it works for general parquet files. Now, the error that I get is this:

ValueError: Directory name did not appear to be a partition: 2019

I've already tried to dive into 2019 since I figured that the first level only had 2019 as a folder so it might think it was a subdirectory and not a partition.

The path then looked like s3://bucket/folder/2019

However, that gave me the following error, along similar lines:

ValueError: Directory name did not appear to be a partition: 05

I've also tried using fastparquet following an approach from this question: How to read partitioned parquet files from S3 using pyarrow in python

That didn't work either. If I tried printing the list of files using the all_paths_from_s3 from the answer to the question above, it gave me a blank list [].

1

1 Answers

2
votes

It happens because the partitioning path should look like this:

s3://bucket/folder/year=2019/month=05/day=01

If you are using Kinesis Firehose to persist the data to the S3 buckets (for example), you can use the prefix option to override the default AWS year/month/day/hour format.