Python Pandas to convert CSV to Parquet using Fastparquet

Question

I am using Python 3.6 interpreter in my PyCharm venv, and trying to convert a CSV to Parquet.

import pandas as pd    
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet')

Error-1 ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'. pyarrow or fastparquet is required for parquet support

Solution-1 Installed fastparquet 0.2.1

Error-2 File "/Users/python parquet/venv/lib/python3.6/site-packages/fastparquet/compression.py", line 131, in compress_data (algorithm, sorted(compressions))) RuntimeError: Compression 'snappy' not available. Options: ['GZIP', 'UNCOMPRESSED']

I Installed python-snappy 0.5.3 but still getting the same error? Do I need to install any other library?

If I use PyArrow 0.12.0 engine, I don't experience the issue.

MarcosBernal MarcosBernal · Accepted Answer · 2020-01-17T15:50:52

In fastparquet snappy compression is an optional feature.

To quickly check a conversion from csv to parquet, you can execute the following script (only requires pandas and fastparquet):

import pandas as pd
from fastparquet import write, ParquetFile
df = pd.DataFrame({"col1": [1,2,3,4], "col2": ["a","b","c","d"]})
# df.head() # Test your initial value
df.to_csv("/tmp/test_csv", index=False)
df_csv = pd.read_csv("/tmp/test_csv")
df_csv.head() # Test your intermediate value
df_csv.to_parquet("/tmp/test_parquet", compression="GZIP")
df_parquet = ParquetFile("/tmp/test_parquet").to_pandas()
df_parquet.head() # Test your final value

However, if you need to write or read using snappy compression you might follow this answer about installing snappy library on ubuntu.

Python Pandas to convert CSV to Parquet using Fastparquet

1 Answers