29
votes

I'm trying to find the faster way to do batch insert.

I tried to insert several batches with jdbcTemplate.update(String sql), where sql was builded by StringBuilder and looks like:

INSERT INTO TABLE(x, y, i) VALUES(1,2,3), (1,2,3), ... , (1,2,3)

Batch size was exactly 1000. I inserted nearly 100 batches. I checked the time using StopWatch and found out insert time:

min[38ms], avg[50ms], max[190ms] per batch

I was glad but I wanted to make my code better.

After that, I tried to use jdbcTemplate.batchUpdate in way like:

    jdbcTemplate.batchUpdate(sql, new BatchPreparedStatementSetter() {
        @Override
        public void setValues(PreparedStatement ps, int i) throws SQLException {
                       // ...
        }
        @Override
        public int getBatchSize() {
            return 1000;
        }
    });

where sql was look like

INSERT INTO TABLE(x, y, i) VALUES(1,2,3);

and I was disappointed! jdbcTemplate executed every single insert of 1000 lines batch in separated way. I loked at mysql_log and found there a thousand inserts. I checked the time using StopWatch and found out insert time:

min[900ms], avg[1100ms], max[2000ms] per Batch

So, can anybody explain to me, why jdbcTemplate doing separated inserts in this method? Why method's name is batchUpdate? Or may be I am using this method in wrong way?

9
I ran into a similar problem trying to use spring batch. I found that doing the inserts manually using a JDBC connection (PreparedStatement.addBatch() and calling executeBatch() and committing every thousand records) was an order of magnitude faster than using Spring. Never really figured out why though I suspected it had to do with how commits were being applied. I was using various batch sizes from 100 to 10000 in Spring.reblace
Might be related to the flush value of the underlying connection?Taylor
Are you using rewriteBatchedStatements=true in the JDBC connection string? The Spring documentation states that Will fall back to separate updates on a single PreparedStatement if the JDBC driver does not support batch updates.Elliot Vargas
"&rewriteBatchedStatements=true"; worked for me and approved performance.Jeryl Cook
This could be because auto-commit being true on the connection. If spring transactions are leveraged or datasource has auto-commit off, this behaviour should not happen.Krishna

9 Answers

22
votes

These parameters in the JDBC connection URL can make a big difference in the speed of batched statements --- in my experience, they speed things up:

?useServerPrepStmts=false&rewriteBatchedStatements=true

See: JDBC batch insert performance

16
votes

I have also faced the same issue with Spring JDBC template. Probably with Spring Batch the statement was executed and committed on every insert or on chunks, that slowed things down.

I have replaced the jdbcTemplate.batchUpdate() code with original JDBC batch insertion code and found the Major performance improvement.

DataSource ds = jdbcTemplate.getDataSource();
Connection connection = ds.getConnection();
connection.setAutoCommit(false);
String sql = "insert into employee (name, city, phone) values (?, ?, ?)";
PreparedStatement ps = connection.prepareStatement(sql);
final int batchSize = 1000;
int count = 0;

for (Employee employee: employees) {

    ps.setString(1, employee.getName());
    ps.setString(2, employee.getCity());
    ps.setString(3, employee.getPhone());
    ps.addBatch();

    ++count;

    if(count % batchSize == 0 || count == employees.size()) {
        ps.executeBatch();
        ps.clearBatch(); 
    }
}

connection.commit();
ps.close();

Check this link as well JDBC batch insert performance

15
votes

I found a major improvement setting the argTypes array in the call.

In my case, with Spring 4.1.4 and Oracle 12c, for insertion of 5000 rows with 35 fields:

jdbcTemplate.batchUpdate(insert, parameters); // Take 7 seconds

jdbcTemplate.batchUpdate(insert, parameters, argTypes); // Take 0.08 seconds!!!

The argTypes param is an int array where you set each field in this way:

int[] argTypes = new int[35];
argTypes[0] = Types.VARCHAR;
argTypes[1] = Types.VARCHAR;
argTypes[2] = Types.VARCHAR;
argTypes[3] = Types.DECIMAL;
argTypes[4] = Types.TIMESTAMP;
.....

I debugged org\springframework\jdbc\core\JdbcTemplate.java and found that most of the time was consumed trying to know the nature of each field, and this was made for each record.

Hope this helps !

9
votes

Simply use transaction. Add @Transactional on method.

Be sure to declare the correct TX manager if using several datasources @Transactional("dsTxManager"). I have a case where inserting 60000 records. It takes about 15s. No other tweak:

@Transactional("myDataSourceTxManager")
public void save(...) {
...
    jdbcTemplate.batchUpdate(query, new BatchPreparedStatementSetter() {

            @Override
            public void setValues(PreparedStatement ps, int i) throws SQLException {
                ...

            }

            @Override
            public int getBatchSize() {
                if(data == null){
                    return 0;
                }
                return data.size();
            }
        });
    }
7
votes

Change your sql insert to INSERT INTO TABLE(x, y, i) VALUES(1,2,3). The framework creates a loop for you. For example:

public void insertBatch(final List<Customer> customers){

  String sql = "INSERT INTO CUSTOMER " +
    "(CUST_ID, NAME, AGE) VALUES (?, ?, ?)";

  getJdbcTemplate().batchUpdate(sql, new BatchPreparedStatementSetter() {

    @Override
    public void setValues(PreparedStatement ps, int i) throws SQLException {
        Customer customer = customers.get(i);
        ps.setLong(1, customer.getCustId());
        ps.setString(2, customer.getName());
        ps.setInt(3, customer.getAge() );
    }

    @Override
    public int getBatchSize() {
        return customers.size();
    }
  });
}

IF you have something like this. Spring will do something like:

for(int i = 0; i < getBatchSize(); i++){
   execute the prepared statement with the parameters for the current iteration
}

The framework first creates PreparedStatement from the query (the sql variable) then the setValues method is called and the statement is executed. that is repeated as much times as you specify in the getBatchSize() method. So the right way to write the insert statement is with only one values clause. You can take a look at http://docs.spring.io/spring/docs/3.0.x/reference/jdbc.html

5
votes

I don't know if this will work for you, but here's a Spring-free way that I ended up using. It was significantly faster than the various Spring methods I tried. I even tried using the JDBC template batch update method the other answer describes, but even that was slower than I wanted. I'm not sure what the deal was and the Internets didn't have many answers either. I suspected it had to do with how commits were being handled.

This approach is just straight JDBC using the java.sql packages and PreparedStatement's batch interface. This was the fastest way that I could get 24M records into a MySQL DB.

I more or less just built up collections of "record" objects and then called the below code in a method that batch inserted all the records. The loop that built the collections was responsible for managing the batch size.

I was trying to insert 24M records into a MySQL DB and it was going ~200 records per second using Spring batch. When I switched to this method, it went up to ~2500 records per second. so my 24M record load went from a theoretical 1.5 days to about 2.5 hours.

First create a connection...

Connection conn = null;
try{
    Class.forName("com.mysql.jdbc.Driver");
    conn = DriverManager.getConnection(connectionUrl, username, password);
}catch(SQLException e){}catch(ClassNotFoundException e){}

Then create a prepared statement and load it with batches of values for insert, and then execute as a single batch insert...

PreparedStatement ps = null;
try{
    conn.setAutoCommit(false);
    ps = conn.prepareStatement(sql); // INSERT INTO TABLE(x, y, i) VALUES(1,2,3)
    for(MyRecord record : records){
        try{
            ps.setString(1, record.getX());
            ps.setString(2, record.getY());
            ps.setString(3, record.getI());

            ps.addBatch();
        } catch (Exception e){
            ps.clearParameters();
            logger.warn("Skipping record...", e);
        }
    }

    ps.executeBatch();
    conn.commit();
} catch (SQLException e){
} finally {
    if(null != ps){
        try {ps.close();} catch (SQLException e){}
    }
}

Obviously I've removed error handling and the query and Record object is notional and whatnot.

Edit: Since your original question was comparing the insert into foobar values (?,?,?), (?,?,?)...(?,?,?) method to Spring batch, here's a more direct response to that:

It looks like your original method is likely the fastest way to do bulk data loads into MySQL without using something like the "LOAD DATA INFILE" approach. A quote from the MysQL docs (http://dev.mysql.com/doc/refman/5.0/en/insert-speed.html):

If you are inserting many rows from the same client at the same time, use INSERT statements with multiple VALUES lists to insert several rows at a time. This is considerably faster (many times faster in some cases) than using separate single-row INSERT statements.

You could modify the Spring JDBC Template batchUpdate method to do an insert with multiple VALUES specified per 'setValues' call, but you'd have to manually keep track of the index values as you iterate over the set of things being inserted. And you'd run into a nasty edge case at the end when the total number of things being inserted isn't a multiple of the number of VALUES lists you have in your prepared statement.

If you use the approach I outline, you could do the same thing (use a prepared statement with multiple VALUES lists) and then when you get to that edge case at the end, it's a little easier to deal with because you can build and execute one last statement with exactly the right number of VALUES lists. It's a bit hacky, but most optimized things are.

5
votes

I had also some bad time with Spring JDBC batch template. In my case, it would be, like, insane to use pure JDBC, so instead I used NamedParameterJdbcTemplate. This was a must have in my project. But it was way slow to insert hundreds os thousands of lines in the database.

To see what was going on, I've sampled it with VisualVM during the batch update and, voilà:

visualvm showing where it was slow

What was slowing the process was that, while setting the parameters, Spring JDBC was querying the database to know the metadata each parameter. And seemed to me that it was querying the database for each parameter for each line every time. So I just taught Spring to ignore the parameter types (as it is warned in the Spring documentation about batch operating a list of objects):

    @Bean(name = "named-jdbc-tenant")
    public synchronized NamedParameterJdbcTemplate getNamedJdbcTemplate(@Autowired TenantRoutingDataSource tenantDataSource) {
        System.setProperty("spring.jdbc.getParameterType.ignore", "true");
        return new NamedParameterJdbcTemplate(tenantDataSource);
    }

Note: the system property must be set before creating the JDBC Template object. It would be possible to just set in the application.properties, but this solved and I've never after touched this again

1
votes

Solution given by @Rakesh worked for me. Significant improvement in performance. Earlier time was 8 min, with this solution taking less than 2 min.

DataSource ds = jdbcTemplate.getDataSource();
Connection connection = ds.getConnection();
connection.setAutoCommit(false);
String sql = "insert into employee (name, city, phone) values (?, ?, ?)";
PreparedStatement ps = connection.prepareStatement(sql);
final int batchSize = 1000;
int count = 0;

for (Employee employee: employees) {

    ps.setString(1, employee.getName());
    ps.setString(2, employee.getCity());
    ps.setString(3, employee.getPhone());
    ps.addBatch();

    ++count;

    if(count % batchSize == 0 || count == employees.size()) {
        ps.executeBatch();
        ps.clearBatch(); 
    }
}

connection.commit();
ps.close();
0
votes

Encountered some serious performance issue with JdbcBatchItemWriter.write() (link) from Spring Batch and find out the write logic delegates to JdbcTemplate.batchUpdate() eventually.

Adding a Java system properties of spring.jdbc.getParameterType.ignore=true fixed the performance issue entirely ( from 200 records per second to ~ 5000 ). The patch was tested working on both Postgresql and MsSql (might not be dialect specific)

... and ironically, Spring documented this behaviour under a "note" section link

In such a scenario, with automatic setting of values on an underlying PreparedStatement, the corresponding JDBC type for each value needs to be derived from the given Java type. While this usually works well, there is a potential for issues (for example, with Map-contained null values). Spring, by default, calls ParameterMetaData.getParameterType in such a case, which can be expensive with your JDBC driver. You should use a recent driver version and consider setting the spring.jdbc.getParameterType.ignore property to true (as a JVM system property or in a spring.properties file in the root of your classpath) if you encounter a performance issue — for example, as reported on Oracle 12c (SPR-16139).

Alternatively, you might consider specifying the corresponding JDBC types explicitly, either through a 'BatchPreparedStatementSetter' (as shown earlier), through an explicit type array given to a 'List<Object[]>' based call, through 'registerSqlType' calls on a custom 'MapSqlParameterSource' instance, or through a 'BeanPropertySqlParameterSource' that derives the SQL type from the Java-declared property type even for a null value.