How to read a defined list of parquet files from s3 using PyArrow?

Question

I need to incrementally load data to Pandas from Parquet files stored in s3, i'm trying to use PyArrow for this but not having any luck.

Writing an entire directory of Parquet files into Pandas works just fine:

import s3fs
import pyarrow.parquet as pq
import pandas as pd

fs = s3fs.S3FileSystem(mykey,mysecret)
p_dataset = pq.ParquetDataset('s3://mys3bucket/directory', filesystem=fs)

df = p_dataset.read().to_pandas()

But when I try to load a single Parquet file I get an error:

fs = s3fs.S3FileSystem(mykey,mysecret)
p_dataset = pq.ParquetDataset('s3://mys3bucket/directory/1_0_00000000000000014012'
, filesystem=fs)

df = p_dataset.read().to_pandas()

Throws error:

    ---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-179-3d01b32c60f7> in <module>()
     15 p_dataset = pq.ParquetDataset(
     16     's3://mys3bucket/directory/1_0_00000000000000014012',
---> 17                       filesystem=fs)
     18 
     19 table2.to_pandas()

C:\User\Anaconda3\lib\site-packages\pyarrow\parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads)
    880 
    881         if validate_schema:
--> 882             self.validate_schemas()
    883 
    884         if filters is not None:

C:\User\Anaconda3\lib\site-packages\pyarrow\parquet.py in validate_schemas(self)
    893                 self.schema = self.common_metadata.schema
    894             else:
--> 895                 self.schema = self.pieces[0].get_metadata(open_file).schema
    896         elif self.schema is None:
    897             self.schema = self.metadata.schema

IndexError: list index out of range

Would appreciate any help with this error.

Ideally I need to append all new data added to s3 (added since the previous time I ran this script) to the Pandas dataframe so was thinking I pass a list of filenames to ParquetDataset. Is there a better way to achieve this? Thanks

Wes McKinney Wes McKinney · Accepted Answer · 2018-11-04T08:50:57

You want to use pq.read_table (pass a file path or file handle) instead of pq.ParquetDataset (pass a directory). HTH

How to read a defined list of parquet files from s3 using PyArrow?

2 Answers