0
votes

I need to write ~50 million rows fetched from a jdbc ResultSet to a CSV file.
1.5 million rows written to a CSV file amounts to 1 GB approximately.

jdbcTemplate.query(new CustomPreparedStatementCreator(arg), new ResultSetExtractor<Void>() {
    @Override
    public Void extractData(ResultSet rs) {
    while (rs.next()) {
        // transform each row's data (involves creation of objects)
        // write the transformed strings to csv file
    }
}

The problem is I have a heap of 8 GB and it gets filled up pretty fast.
Hence I run into java.lang.OutOfMemoryError before I get to 10 million rows.
Another limitation I have is the query read/write timeout which is set to 30 minutes.

What can I do to recycle and reuse the JVM heap memory?
Especially the memory allocated for objects that I don't need anymore.

I read that forcing GC to run does not guarantee memory will be reclaimed.

What are my options? Should I defer the responsibility to non-GC languages
like C,C++ via JNA or JNI to process the ResultSet?

[EDIT] It seems I am in a tough spot :D Adding more info as pointed out by @rzwitserloot

  1. I am reading (SELECT queries only) data from a data-virtualization tool that is hooked to a data lake.
  2. The data-virtualization tool's jdbc driver does support LIMIT but the queries are designed by the business to return huge volumes of data. So I've got one-shot to pull the data and generate a CSV - meaning, I cannot avoid the giant SELECT or put a LIMIT clause
  3. I need to check these properties: resultSetType, resultSetConcurrency, resultSetHoldability.

What I have already done:

First, I used a Producer-Consumer pattern to separate the jdbc fetch operations from slow file write operations. This helped create CSV files containing 1-5 million rows before 30 mins timeout.

Second, I increased the number of consumer threads and have them write to their own separate part-file only to be merged later into a single CSV file. This sped up file write and create a CSV file containing 10-20 million rows before the 30 mins timeout.

I am creating objects inside the ResultSetExtractor and passing it to consumer threads via a bounded queue. These objects are not needed once the data from them is written to the file.

1
Have you considered the use of RowCallbackHandler? ( mkyong.com/spring/spring-jdbctemplate-handle-large-resultset )DaveH
I've noticed GC can be too slow to release if you're chewing up heap quickly. "Forcing" GC to run doesn't guarantee it, but System.gc() has been effective for me avoiding memory issues in situations like this, where I put in the loop, if (Runtime.getRuntime().freeMemory() < someThresholdTweakedForPerformance) { System.gc(); }. Probably not the best option long term but it worked.Daniel Widdis
@DaveH I am not sure that would help. If you look at the example it creates a String object on the heap. String name = resultSet.getString("Name"); This is going to stay there and will be referred if we get another string from the result set with the same value.sprkv5
@DanielWiddis, I am actually chewing up heap pretty fast, that's true. And this breaks down at around 6-8 million rows, since there are other tasks running on the same JVM instance. So calling System.gc(); is not a guarantee to reclaim memory - a guarantee needed for this task to be able to generate the CSV.sprkv5
Is it really the case that the string will stay in the heap? Declare the variable outside the loop and ressign it inside. The "old" string will be immediately available for garbage collection as it will have no active references to it. Isn't your problem mainly that the rowsetextractor is build a list with 50 million rows in it and then processing it sequentially? RowCallBackHandler will process each row as it is returned from te resultsetDaveH

1 Answers

2
votes

You've pasted very little code; one of the key clues is that by design, the code you pasted has no memory issues - ResultSet is intentionally designed to be cursor-esque, meaning, in theory every .next() call results in TCP/IP traffic, asking the DB to fetch another row. This is why resultsets need to be closed (Because the database is maintaining a separate 'version' so that, assuming you're using serializable or some other clean-reads isolation level, any other transaction that was started (or rather, wasnt yet committed) when you opened the one you are in* doesn't have any effect on the data you are witnessing as you go through the .next() calls.

Now, the JDBC API is also quite flexible. For example, that's a lot of packets and traffic and work, so in practice many DB JDBC drivers will either just send all the data at once, and resultset .close does nothing, or will at least send things in larger batches, and most .next() calls result in no DB traffic, except for every 100th call, or whatnot.

Thus, we have 2 major options here:

  1. The memory leak has nothing whatsoever to do with what you pasted; for example, you're writing your CSV data into an ever growing buffer and you're not streaming it to disk at all. Triple check this. Replace your giant SELECT with a LIMIT clause and add a giant for loop around it to simulate writing a ton of records without actually querying much from your JDBC loop. If that still runs out of memory, it's not your db layer.

  2. The JDBC driver is nevertheless implementing its ResultSet implementation with something that continually takes memory.

IF it is #2, then you have 2 solutions:

  1. Make the DB engine not do that. ResultSets have 'features', and you ask for which feature(s) you need as you make them. For example, you can tell the system you want the resultset to be so-called 'forward only'. The 3 properties that are the most likely to result in non-memory-chewing ResultSets are those initialized with resultSetType = FORWARD_ONLY, resultSetConcurrency = CONCUR_READ_ONLY, and resultSetHoldability = CLOSE_CURSORS_AT_COMMIT. I don't actually know how to tell jdbctemplate to do this, but it shouldn't be too difficult - jdbctemplate is calling java.sql.Connection's prepareStatement method - make sure it calls the one where you set all those properties to those values.

  2. If that doesn't work, work around it, using OFFSET/LIMIT (the syntax for this depends on your DB engine unfortunately) to fetch pages at a time. Of course, if the table is being edited while you are doing this, unless you have serializable transaction level set up, that's going to mess with your stuff, and you MUST add an ORDER BY clause of some sort or you don't get an actual guarantee results are returned in the same order (and without that, OFFSET/LIMIT paging isn't going to do what you want). That's a bit bizarre - what kind of exotic third rate badly written DB engine and/or JDBC driver are you using, if this is happening to you?

*) "But, I'm not using transactions!" - yes, you are, 'auto commit = true' is what is commonly known as 'no transactions', but that is not true; it's simply 1 transaction per SQL statement you send. The only truly 'no transactions' mode are things your DB explicitly says exist outside of transactions, not-actually-safe-DBs like MySQL with the MyISAM table type, which isn't really a DB in the first place, or intentionally lenient isolation levels.