I have parquet files for 2 Billion records with GZIP compression and the same data with SNAPPY compression. Also, I have Delimited files for the same 2 Billion records. We have 72 Vertica nodes in AWS prod, we are seeing a huge performance spike for parquet files while moving data from s3 to Vertica with COPY command than Delimited files. Parquet takes 7x more time than Delimited files eventhough delimited file size is 50X more than parquet.
Below are the stats for the test we conducted.
Total file sizes are
Parquet GZIP - 6 GB
Parquet Snappy - 9.2 GB
Delimited - 450GB
Below are the copy command used for both Parquet and Delimited. We did see some 2 mins improvement when we removed "No commit" in the copy query.
Delimited files
COPY schema.table1 ( col1,col2,col3,col4,col5 ) FROM 's3://path_to_delimited_s3/*' DELIMITER E'\001' NULL AS '\N' NO ESCAPE ABORT ON ERROR DIRECT NO COMMIT;
Parquet files
COPY schema.table2 (col1,col2,col3,col4,col5 ) FROM 's3://path_to_parquet_s3/*' PARQUET ABORT ON ERROR DIRECT NO COMMIT;
We are surprised to see this spike w.r.t parquet files, Is this expected for parquet copy? Any pointers, thoughts will be really helpful.
Thanks
