I have an ETL job loading a lot of CSV files. Some of those CSV files contain the same type of data, e.g. 60 files with data for one initial dataframe, another 30 files with data for another initial dataframe and so on. Then those dataframes get joined and aggregated using Dataframe API and ultimately the final dataframe gets saved to one Parquet file.
Is it beneficial for me to first convert all the groups of CSV files to single Parquet files before reading those Parquet files and further processing? Would it make things faster (considering that this conversion step would run every time in my job)? Would it help to use less memory by Spark as my dataframes would be now backed by Parquet files rather than CSV files?