Read/write parquet files with AWS Lambda?

Question

Hi I need a lambda function that will read and write parquet files and save them to S3. I tried to make a deployment package with libraries that I needed to use pyarrow but I am getting initialization error for cffi library:

module initialization error: [Errno 2] No such file or directory: '/var/task/__pycache__/_cffi__x762f05ffx6bf5342b.c'

Can I even make parquet files with AWS Lambda? Did anyone had similar problem?

I would like to do something like this:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

df = pd.DataFrame([data]) #data is dictionary
table = pa.Table.from_pandas(df)
pq.write_table(table, 'tmp/test.parquet', compression='snappy')
table = pq.read_table('tmp/test.parquet')
table.to_pandas()
print(table)

Or by some other method, just need to be able to read and write parquet files compressed with snappy.

Paul Zielinski Paul Zielinski · Accepted Answer · 2017-11-08T00:01:49

I believe this is an issue with missing the snappy shared object file in the package deployed to lambda.

https://github.com/andrix/python-snappy/issues/52#issuecomment-342364113

I got the same error when trying to encode with snappy from a Lambda function (which is invoked from a directory to which it does not have write permissions), including libsnappy.so.1 in my zipfile resolved it.

Read/write parquet files with AWS Lambda?

3 Answers