0
votes

I wrote some straightforward code to read text files (>1g) and do some processing on Strings.

However, I have to deal with Java heap space problems since I try to append Strings (using StringBuilder) that are getting to big on memory usage at some point. I know that I can increase my heap space with, e. g. '-Xmx1024', but I would like to work with only little memory usage here.How could I change my code below to manage my operations?

I am still a Java novice and maybe I made some mistakes in my code which may seem obvious to you.

Here's the code snippet:

    private void setInputData() {

    Pattern pat = Pattern.compile("regex");
    BufferedReader br = null;
    Matcher mat = null;

    try {
        File myFile = new File("myFile");
        FileReader fr = new FileReader(myFile);

        br = new BufferedReader(fr);
        String line = null;
        String appendThisString = null;
        String processThisString = null;
        StringBuilder stringBuilder = new StringBuilder();

        while ((line = br.readLine()) != null) {

            mat = pat.matcher(line);

            if (mat.find()) {
                appendThisString = mat.group(1);
            }

            if (line.contains("|")) {
                processThisString = line.replace(" ", "").replace("|", "\t");
                stringBuilder.append(processThisString).append("\t").append(appendThisString);
                stringBuilder.append("\n");
            }
        }
//      doSomethingWithTheString(stringBuilder.toString());
    } catch (Exception ex) {
        ex.printStackTrace();
    } finally {
        try {
            if (br != null)br.close();
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}

Here's the error message:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
    at java.lang.StringBuilder.append(StringBuilder.java:132)
    at Test.setInputData(Test.java:47)
    at Test.go(Test.java:18)
    at Test.main(Test.java:13)
7
If you can split up calls to doSomethingWithTheString() so it does it in every line, that would probably help a lot.ddmps
Well, I followed Joop Eggen's suggestion to use a database in my case.myX.

7 Answers

1
votes

You can't use StringBuilder in this case. It holds data in memory. I think you should consider saving the result into file in every line.

i.e. Use FileWriter instead of StringBuilder.

1
votes

You could do a dry run, without appending, but counting the total string length.

If doSomethingWithTheString is sequential there would be other solutions.

You could tokenize the string, reducing the size. For instance Huffman compression looks for already present sequences reading a char, possible extends the table and then yields a table index. (The open source OmegaT translation tool uses such a strategy at one spot for tokens.) So it depends on the processing you want to do. Seeing the reading of a kind of CSV a dictionary seems feasible.

In general I would use a database.

P.S. you can save half the memory, writing all to a file, and then rereading the file in one string. Or use a java.nio ByteBuffer on the file, a memory mapped file.

1
votes

The method doSomethingWithTheString() should probably need to change so that it accepts an InputStream as well. While reading the original file content and transforming it line by line you should write the transformed content to a temporary file line by line. Then an input stream to that temporary file could be send to the doSomethingWithTheString() method. Probably the method needs to be renamed as doSomethingWithInputStream().

1
votes

From your example it is not clear what you are going to do with your enormous string once you have modified it. However since your modifications do not appear to span multiple lines I'd just write the modified data to a new file.

In order to do that create and open a new FileWriter object before your while cycle, move your stringBuffer declaration to the beginning of the cycle and write stringBuffer to your new file at the end of the cycle.

If, on the other hand, you do need to combine data coming from different lines consider using a database. Which kind depends on the nature of your data. If it has a record-like organization you might adopt a relational database, such as Apache Derby or MySQL, otherwise you might check out so called No SQL databases, such as Cassandra or MongoDB.

1
votes

The general strategy is to design your application so that it doesn't need to hold the entire file (or too large a proportion of it) in memory.

Depending on what your application does:

  • You could write the intermediate data to a file and read it back again a line at a time to process it.
  • You could pass each line read to the processing algorithm; e.g. by calling doSomethingWithTheString(...) on each line individually rather than all of them.

But if you need to have the entire file in memory, you are between a rock and a hard place.


The other thing to note is that using a StringBuilder like that may require up to 6 times as much memory as the file size. It goes like this.

  • When the StringBuilder needs to expand its internal buffer it does this by making a char array twice the size of the current buffer, and copying from the old to the new. At that point you have 3 times as much buffer space allocated as you have before the buffer expansion started. Now suppose that there was just one more character to append to the buffer.

  • If the file is in ASCII (or another 8 bit charset), the StringBuilder's buffer needs twice that amount of memory ... because it consists of char not byte values.

If you have a good estimate of the number of characters that will be in the final string (e.g. from the file size), you can avoid the x3 multiplier by giving a capacity hint when you create the StringBuilder. However, you mustn't underestimate, 'cos if you underestimate just slightly ...

You could also use a byte-oriented buffer (e.g. a ByteArrayOutputStream) instead of a StringBuilder ... and then read it with a ByteArrayInputStream / StreamReader / BufferedReader pipeline.

But ultimately, holding a large file in memory doesn't scale as the file size increases.

0
votes

Are you sure there is a line terminator in the file? If not, your while loop will just keeps looping and leads to your error. If so, it might worth trying reading a fixed number of bytes at a time so that the reader won't grow infinitely.

0
votes

I suggest the use of Guavas FileBackedOutputStream. You gain the advantage of having an OutputStream that will eat up disk io instead of main memory. Of course access will be slower due to the disk io, but, if you are dealing with such a large stream, and you are unable to chunk it into a more managable size, it is a good option.