1
votes

We have a ColdFusion 9 script that runs regularly reading a CSV file and inserting the records into an Oracle 11g database. The CSV file has around 50 columns, 8 of which are used by CF (the CSV format cannot be modified). The general flow of the CFM is:

  1. Read file into variable
  2. CFLOOP using that variable as the list attribute with CHR(13) as a delimiter
  3. Call an Oracle stored procedure inserting various values from the file using ListGetAt

The stored procedure does the following:

  1. Inserts a record with 2 fields into table 1
  2. Inserts a record with 8 fields (including table 1's primary key) into table 2
  3. Returns nothing

This runs successfully in most cases, reading 400 KB files with hundreds of records in just a couple seconds. However, occasionally we get a large volume and end up with a 13k record 5MB file. When we try to process a file this large I watch the JVM memory usage go from 90MB up to around 680MB over a 10-15 second period, after which the CF Server Monitor stops responding (as does CF) forcing us to restart the service. Logs report a JVM out-of-memory error:

"Error","qtp4795249-38798","12/28/11","16:29:20",,"GC overhead limit exceeded" java.lang.OutOfMemoryError: GC overhead limit exceeded

Our JVM heap size is currently 768MB. I haven't tried increasing it as even if that does resolve this issue it doesn't protect us in the future and the rest of the server's normal load doesn't require nearly that much. And I hesitate to play too much with JVM settings that require a restart to go into effect on a production box.

This is difficult to test as the import process runs fine with barely any noticeable memory load on my local dev machine and on our QA box, but both of those have a much slower connection to the database and take 10-15 minutes to complete.

I would appreciate any thoughts, especially about where the memory is going. I cannot figure out how 5MB of data gets turned into 700 MB of data. We do have debugging information turned on, but the calling script's IP address is not in the debugging list and I used a cfsetting tag to turn debugging off for this page. Previously there was a step 1.5 that turned the CSV data into a ColdFusion query, but I eliminated that in an attempt to improve efficiency. Both ways result in an oom error.

4
Thanks for all the responses. Due to month-end and year-end processing occuring with the customer, and their response to taking down the CF server for 30 seconds yesterday, we will be waiting a week or so before trying to implement or test any of these suggestions. I will update this question then.Nicholas

4 Answers

1
votes

Have you considered direct import to database? For MySQL this is LOAD DATA INFILE, for SQL Server it is BULK INSERT. If you need some additional processing than possible approach would be loading data to the temporary table and then handling it with CFML, which can easily be done in batches for heavy processing.

1
votes

Rather than read the whole file into memory before you start processing it, loop through the file contents by reading it one line at a time. Ben Nadel (of course) has a nice blog post discussing this technique:

Reading In File Data One Line At A Time Using ColdFusion's CFLoop Tag Or Java's LineNumberReader

Also see the livedocs for cfloop, specifically the file attribtue:

0
votes

We have a CF app which imports Real Estate MLS listings and ran into similar issues. The core file we work with is pushing 100MB, and reading it in and looping over it at once created lots of problems. We ended up doing a few things:

  1. Split the file up into chunks. The import process uses the split.exe utility from cygwin to split the file into 4,000 line chunks. We then use CFDIRECTORY to get a list of the chunks and process them one at a time.

  2. For each chunk, we read it in and then split the file contents into an array (using listToArray() with chr(13) as the delimiter).

  3. We loop from 1 to arrayLen(chunkArray) rather than looping over the file contents directly. This was done more for speed than anything else. Within that loop we also break each row into an array. We found that doing this and accessing the values as thisRow[i] (where i is the column number in the file) was much faster than repeatedly calling listGetAt(). The files we're importing have 90+ columns.

  4. We increased the JVM memory allowance. Our servers are pretty busy as it is, and this was adding some overhead. We ended up pushing the JVM as high as it would go for a 32-bit server (about 2GB) so that memory was available when needed.

0
votes

Also look at not looping over a query. Some of the biggest amount of time spent executing queries is just making the DB connection.

Anytime I have to do multiple inserts like this from a file, I create each SQL insert statement and save them in a variable seperated by a semicolon. I then execute all of those statements all at once every 100 statements.

I had to rewrite some procudures of another programmer by doing this and was able to cu the pocessing time by 90%. This was in version 6, so connections sharing may have improved this.