I have 66 parquet files wit GZIP compression consisting of 2 billion records in S3, I am using a Vertica copy command to copy data from s3 to Vertica as below.
COPY schema.table ( col1, col2, col3, col4, col5, col6 ) FROM 's3://test_path/*' PARQUET ABORT ON ERROR DIRECT NO COMMIT
We have 4 Vertica nodes. In order to copy these 2 billion rows, it takes 45+ mins. Vertica documentation says that loading files from S3 run in multithread on multiple nodes by default. I was told by our DBA that the way to reach the best performance is running 66 queries (1 query per file) in parallel, this way each query would be running on a different node and each query would be loading a different file.
Vertica Copy command is called programmatically from Java, I don't want to run copy command per files, this becomes a hassle to maintain transaction and also, files might increase to 1000+ during peak load.
I want to bring down the execution time using only one COPY command, any pointers and help will be really appreciated.
Thanks