1
votes

I created a Parquet dataset partitioned as follows:

2019-taxi-trips/
    - month=1/
        - data.parquet
    - month=2/
        - data.parquet
    ...
    - month=12/
        - data.parquet

This organization follows the Parquet dataset partitioning convention used by Hive Metastore. This partitioning scheme was generated by hand, so there is no _metadata file anywhere in the directory tree.

I would like to now read this dataset into Dask.

With data located on local disk, the following code works:

import dask.dataframe as dd
dd.read_parquet(
    "/Users/alekseybilogur/Desktop/2019-taxi-trips/*/data.parquet",
    engine="fastparquet"
)

I copied these files to an S3 bucket (via s3 sync; partition folders are top level keys in the bucket, like so), and attempted to read them off of cloud storage using the same basic function:

import dask.dataframe as dd; dd.read_parquet(
    "s3://2019-nyc-taxi-trips/*/data.parquet",
    storage_options={
        "key": "...",
        "secret": "..."
    },
    engine="fastparquet")

This raises IndexError: list index out of range. Full stack trace here.

Is not is currently possible to read in such a dataset directly from AWS S3?

1
This sounds like a bug, you should post on the dask trackermdurant

1 Answers

1
votes

There is currently a bug in fastparquet that is preventing this code from working. See Dask GH#6713 for details.

In the meantime, until this bug is resolved, one easy solution to this issue is to use the pyarrow backend instead.

dd.read_parquet(
    "s3://2019-nyc-taxi-trips/*/data.parquet",
    storage_options={
        "key": "...",
        "secret": "..."
    },
    engine="pyarrow"
)