I need to write ~50 million rows fetched from a jdbc ResultSet to a CSV file.
1.5 million rows written to a CSV file amounts to 1 GB approximately.
jdbcTemplate.query(new CustomPreparedStatementCreator(arg), new ResultSetExtractor<Void>() {
@Override
public Void extractData(ResultSet rs) {
while (rs.next()) {
// transform each row's data (involves creation of objects)
// write the transformed strings to csv file
}
}
The problem is I have a heap of 8 GB and it gets filled up pretty fast.
Hence I run into java.lang.OutOfMemoryError before I get to 10 million rows.
Another limitation I have is the query read/write timeout which is set to 30 minutes.
What can I do to recycle and reuse the JVM heap memory?
Especially the memory allocated for objects that I don't need anymore.
I read that forcing GC to run does not guarantee memory will be reclaimed.
What are my options? Should I defer the responsibility to non-GC languages
like C,C++ via JNA or JNI to process the ResultSet?
[EDIT] It seems I am in a tough spot :D Adding more info as pointed out by @rzwitserloot
- I am reading (SELECT queries only) data from a data-virtualization tool that is hooked to a data lake.
- The data-virtualization tool's jdbc driver does support LIMIT but the queries are designed by the business to return huge volumes of data. So I've got one-shot to pull the data and generate a CSV - meaning, I cannot avoid the giant SELECT or put a LIMIT clause
- I need to check these properties:
resultSetType
,resultSetConcurrency
,resultSetHoldability
.
What I have already done:
First, I used a Producer-Consumer pattern to separate the jdbc fetch operations from slow file write operations. This helped create CSV files containing 1-5 million rows before 30 mins timeout.
Second, I increased the number of consumer threads and have them write to their own separate part-file only to be merged later into a single CSV file. This sped up file write and create a CSV file containing 10-20 million rows before the 30 mins timeout.
I am creating objects inside the ResultSetExtractor and passing it to consumer threads via a bounded queue. These objects are not needed once the data from them is written to the file.
System.gc()
has been effective for me avoiding memory issues in situations like this, where I put in the loop,if (Runtime.getRuntime().freeMemory() < someThresholdTweakedForPerformance) { System.gc(); }
. Probably not the best option long term but it worked. – Daniel WiddisString name = resultSet.getString("Name");
This is going to stay there and will be referred if we get another string from the result set with the same value. – sprkv5System.gc();
is not a guarantee to reclaim memory - a guarantee needed for this task to be able to generate the CSV. – sprkv5