I have been testing parquet files on Azure Blob as opposed to loading data into PostgreSQL tables since I do a lot of my extract/transform steps with pandas and may explore Spark soon.
- Are there any pros or cons using pyarrow to open csv files instead of pd.read_csv?
- Should I use pyarrow to write parquet files instead of pd.to_parquet?
Ultimately, I am storing raw files (csv, json, and xlsx). I read these with pandas or pyarrow, add some metadata columns, and then save a refined/transformed parquet file (Spark flavor, snappy compression). I then read these transformed files with pyarrow (maybe Spark eventually) and perform some aggregations or other stuff for visualization (which I might save as yet another parquet file).
Am I using more memory by reading data with pyarrow before converting it to a pandas dataframe, or by converting data with pyarrow before saving it as a .parquet file? I just want to make sure I am being efficient since I am setting up an Airflow operator which will be following these steps for a lot of files/data.