2
votes

I have a Java application that reads from a group of tables, process data and then inserts the processed data into one table. The data volume is really large and multi threading is inevitable. The problem is happening when multiple threads attempt to insert at the same time, which causes a write hold and an insert hold. After that all the threads wait for the table to be available until the connection times out.

  • The records being inserted do not have any foreign keys pointing to any table

  • The records being inserted are completely independent from each other

  • The table does not have an incremental ID of any sort

  • No other application is using any of the tables involved in this application

I am using Java prepared statement to do the insert. Each calls an insert method which opens a connection for each thread separately.

Is there anything I'm doing wrong or is it a compatibility issue with Redshift?

1
Are you using a connection pool? Could the contention be in that rather than redshift as I'd expect a data warehouse product to be able to deal with the scenario you outline above ( although I know absolutely nothing about Redshift specifically ) - DaveH
According documentation: COPY and INSERT operations are pure write ... COPY and INSERT operations against the same table are held in a wait state until the lock is released, then they proceed as normal. See docs.aws.amazon.com/es_es/redshift/latest/dg/… - Federico Sierra
But based on this, it means that the threads go into a queue in order to do the insert. But non of the threads are inserting. - Med Zamrik
According to docs.aws.amazon.com/redshift/latest/dg/r_STV_LOCKS.html "Amazon Redshift locks tables to prevent two users from updating the same table at the same time." So this would cause a conflict. - jason.kaisersmith
@DaveH I am using a pool for retrieving the data. But the insert is not using a pool. I'm using a java.sql.Connection object. - Med Zamrik

1 Answers

4
votes

Doing large numbers of small inserts into a Redshift table with individual connections is definitely not what you want to do. The Redshift WLM query queues are not suitable for many small concurrent operations. By default only 5 queries will run concurrently and you want to save those for your actual work queries and not loading. It makes sense when you consider that Redshift is optimized for a small number of long running queries on very large data sets.

For your case I recommend you take a look at using Kinesis Firehose to load your data into Redshift. It is designed for exactly this purpose and you can write to it efficiently from many threads. It will batch the data for you, can give you an opportunity to filter/transform data if needed and have a side effect of backing up to S3 if desired.

Here is a Java example.