3
votes

I have been testing parquet files on Azure Blob as opposed to loading data into PostgreSQL tables since I do a lot of my extract/transform steps with pandas and may explore Spark soon.

  1. Are there any pros or cons using pyarrow to open csv files instead of pd.read_csv?
  2. Should I use pyarrow to write parquet files instead of pd.to_parquet?

Ultimately, I am storing raw files (csv, json, and xlsx). I read these with pandas or pyarrow, add some metadata columns, and then save a refined/transformed parquet file (Spark flavor, snappy compression). I then read these transformed files with pyarrow (maybe Spark eventually) and perform some aggregations or other stuff for visualization (which I might save as yet another parquet file).

Am I using more memory by reading data with pyarrow before converting it to a pandas dataframe, or by converting data with pyarrow before saving it as a .parquet file? I just want to make sure I am being efficient since I am setting up an Airflow operator which will be following these steps for a lot of files/data.

1

1 Answers

3
votes

Are there any pros or cons using pyarrow to open csv files instead of pd.read_csv?

there is no workaround to use pandas dataframe on spark to compute data in distributed mode. pandas dataframe and spark together not practical especially with large datasets. you are basically using the power of spark host not spark itself

Should I use pyarrow to write parquet files instead of pd.to_parquet?

it is not going to be easy to refactor in one shot and replace all pandas dependencies but it would be ideal to use pyarrow at any point of data fetch/set operations

Am I using more memory by reading data with pyarrow before converting it to a pandas dataframe, or by converting data with pyarrow before saving it as a .parquet file?

it uses more memory from pyarrow to pandas

As summary, assuming you are not reading pandas dataframe from a memory dump, so try loading data to spark dataframes and work with spark builtin dataframes and pyarrow parquet files in entire process