python - How to read a Parquet file into Pandas DataFrame?

Question

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.

I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

Do you happen to have the data openly available? My branch of python-parquet github.com/martindurant/parquet-python/tree/py3 had a pandas reader in parquet.rparquet, you could try it. There are many parquet constructs it cannot handle. — mdurant
Wait for the Apache Arrow project that the Pandas author Wes Mckinney is part of. wesmckinney.com/blog/pandas-and-apache-arrow After it is done, users should be able to read in Parquet file directly from Pandas. — XValidated
Since the question is closed as off-topic (but still the first result on Google) I have to answer in a comment.. You can now use pyarrow to read a parquet file and convert it to a pandas DataFrame: import pyarrow.parquet as pq; df = pq.read_table('dataset.parq').to_pandas() — sroecker
Kinda annoyed that this question was closed. Spark and parquet are (still) relatively poorly documented. Am also looking for the answer to this. — user48956
Both the fastparquet and pyarrow libraries make it possible to read a parquet file into a pandas dataframe: github.com/dask/fastparquet and arrow.apache.org/docs/python/parquet.html — ogrisel

chrisaycock chrisaycock · Accepted Answer · 2017-10-31T13:12:54

pandas 0.21 introduces new functions for Parquet:

pd.read_parquet('example_pa.parquet', engine='pyarrow')

or

pd.read_parquet('example_fp.parquet', engine='fastparquet')

The above link explains:

These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

python - How to read a Parquet file into Pandas DataFrame?

6 Answers