Vertica performance issue while copying parquet files from S3

Question

I have 66 parquet files wit GZIP compression consisting of 2 billion records in S3, I am using a Vertica copy command to copy data from s3 to Vertica as below.

COPY schema.table ( col1, col2, col3, col4, col5, col6 ) FROM 's3://test_path/*' PARQUET ABORT ON ERROR DIRECT NO COMMIT

We have 4 Vertica nodes. In order to copy these 2 billion rows, it takes 45+ mins. Vertica documentation says that loading files from S3 run in multithread on multiple nodes by default. I was told by our DBA that the way to reach the best performance is running 66 queries (1 query per file) in parallel, this way each query would be running on a different node and each query would be loading a different file.

Vertica Copy command is called programmatically from Java, I don't want to run copy command per files, this becomes a hassle to maintain transaction and also, files might increase to 1000+ during peak load.

I want to bring down the execution time using only one COPY command, any pointers and help will be really appreciated.

Thanks

marcothesane marcothesane · Accepted Answer · 2019-10-24T22:32:02

Try this one:

COPY schema.table ( col1, col2, col3, col4, col5, col6 ) 
FROM 's3://test_path/*' ON ANY NODE PARQUET
ABORT ON ERROR DIRECT;

I don't have any idea how your DBA got the idea you should run one command per file ... The command above will involve each of the 4 nodes in the parsing phase, usually even with several parallel parsers per node. And once the parsing phase is completed, the data will be segmented according to the table's projections' segmentation scheme across the nodes, each node will sort and encode its own data and finally write it to disk - and will commit at the end. Just remember the ON ANY NODE directive after the file glob string.

With a few 1000 instead of your 60 files you might eventually get to a point where the performance deteriorates (with a few terabyte's worth of data, for example, depending on the RAM size of your nodes) - and then you might want to split into several commands using the usual divide-and-conquer approach. Just try it for starters ...

Vertica performance issue while copying parquet files from S3

1 Answers