1
votes

I have an ETL job loading a lot of CSV files. Some of those CSV files contain the same type of data, e.g. 60 files with data for one initial dataframe, another 30 files with data for another initial dataframe and so on. Then those dataframes get joined and aggregated using Dataframe API and ultimately the final dataframe gets saved to one Parquet file.

Is it beneficial for me to first convert all the groups of CSV files to single Parquet files before reading those Parquet files and further processing? Would it make things faster (considering that this conversion step would run every time in my job)? Would it help to use less memory by Spark as my dataframes would be now backed by Parquet files rather than CSV files?

1

1 Answers

2
votes

I think this would only be beneficial if your data was immutable. If the CSV files are ever changing, then you'll be just reading-converting-writing before each time you need to do processing.

Once spark has read in the CSV files, the data is already loaded into memory.

If I recall correctly, the efficiencies that parquet files provide is the ability to only read the necessary data from disk if immediately filtering the dataset. I believe it basically pushes the filter down to the disk read code and that process is able to skip non-interesting records.

In your case, you have to read the entire CSV to write the parquet anyway, so you'd be just wasting resources on the write and re-read steps.