0
votes

I am trying to use below code to download and read data from file, any how this goes OOM, exactly while reading the file, the size of s3 file is 22MB, I downloaded through browser it is 650 MB, but when I monitor through visual VM, memory consumed while uncompressing and reading is more than 2GB. Anyone please guide so that I would find the reason of high memory usage. Thanks.

public static String unzip(InputStream in) throws IOException, CompressorException, ArchiveException {
            System.out.println("Unzipping.............");
            GZIPInputStream gzis = null;
            try {
                gzis = new GZIPInputStream(in);
                InputStreamReader reader = new InputStreamReader(gzis);
                BufferedReader br = new BufferedReader(reader);
                double mb = 0;
                String readed;
                int i=0;
                while ((readed = br.readLine()) != null) {
                     mb = mb+readed.getBytes().length / (1024*1024);
                     i++;
                     if(i%100==0) {System.out.println(mb);}
                }


            } catch (IOException e) {
                e.printStackTrace();
                LOG.error("Invoked AWSUtils getS3Content : json ", e);
            } finally {
                closeStreams(gzis, in);
            }

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596) at java.lang.StringBuffer.append(StringBuffer.java:367) at java.io.BufferedReader.readLine(BufferedReader.java:370) at java.io.BufferedReader.readLine(BufferedReader.java:389) at com.kpmg.rrf.utils.AWSUtils.unzip(AWSUtils.java:917)

Monitoring

2
Please edit your question to include the actual exception that you're getting, including the stacktrace. Indicate which line of the code that you posted is throwing the exception.Kenster
Are you saying the unzipped file is 650 MB, and your VM uses 2 GB before running OOM?AndyMan
Thank you Kenster, added more info to the question.Aadam
@AndyMan JVM uses more than 2 GB, to cross check I downloaded through browser from s3(where file size is 22MB), after getting downloaded it was around 650 MB on DiskAadam
Is this the real code causing the problems? It seems like something would be missing from it, namely the code that actually does something with the data you are reading in. All you posted here is some logic which counts the number of megabytes.Gimby

2 Answers

1
votes

This is a theory, but I can't think of any other reasons why your example would OOM.

Suppose that the uncompressed file consists contains a very long line; e.g. something like 650 million ASCII bytes.

Your application seems to just read the file a line at a time and (try to) display a running total of the megabytes that have been read.

Internally, the readLine() method reads characters one at a time and appends them to a StringBuffer. (You can see the append call in the stack trace.) If the file consist of a very large line, then the StringBuffer is going to get very large.

  • Each text character in the uncompressed string becomes a char in the char[] that is the buffer part of the StringBuffer.

  • Each time the buffer fills up, StringBuffer will grow the buffer by (I think) doubling its size. This entails allocating a new char[] and copying the characters to it.

  • So if the buffer fills when there are N characters, Arrays.copyOf will allocate a char[] hold 2 x N characters. And while the data is being copied, a total of 3 x N of character storage will be in use.

  • So 650MB could easily turn into a heap demand of > 6 x 650M bytes

The other thing to note that the 2 x N array has to be a single contiguous heap node.

Looking at the heap graphs, it looks like the heap got to ~1GB in use. If my theory is correct, the next allocation would have been for a ~2GB node. But 1GB + 2GB is right on the limit for your 3.1GB heap max. And when we take the contiguity requirement into account, the allocation cannot be done.


So what is the solution?

It is simple really: don't use readLine() if it is possible for lines to be unreasonably long.

    public static String unzip(InputStream in) 
            throws IOException, CompressorException, ArchiveException {
        System.out.println("Unzipping.............");
        try (
            GZIPInputStream gzis = new GZIPInputStream(in);
            InputStreamReader reader = new InputStreamReader(gzis);
            BufferedReader br = new BufferedReader(reader);
        ) {
            int ch;
            long i = 0;
            while ((ch = br.read()) >= 0) {
                 i++;
                 if (i % (100 * 1024 * 1024) == 0) {
                     System.out.println(i / (1024 * 1024));
                 }
            }
        } catch (IOException e) {
            e.printStackTrace();
            LOG.error("Invoked AWSUtils getS3Content : json ", e);
        }
0
votes

I also thought of the too long line. On second thought I think the StringBuffer that is used internally by the JVM needs to be converted to the result type of readline: a String. Strings are immutable, but for speed reasons the JVM would not even lookup if a line is duplicate. So it may allocate the String many times, ultimately filling up the heap with no longer used String fragments.

My recommendation would be not to read lines or characters, but chunks of bytes. A byte[] is allocated on the heap and can be thrown away afterwards. Of course you would then count bytes instead of characters. Unless you know the difference and need characters that could be the more stable and performant solution.

This code is just written by memory and not tested:

public static String unzip(InputStream in) 
            throws IOException, CompressorException, ArchiveException {
        System.out.println("Unzipping.............");
        try (
            GZIPInputStream gzis = new GZIPInputStream(in);
        ) {
            byte[] buffer = new byte[8192];
            long i = 0;
            int read = gzis.read(buffer);
            while (read >= 0) {
                 i+=read;
                 if (i % (100 * 1024 * 1024) == 0) {
                     System.out.println(i / (1024 * 1024));
                 }
                 read = gzis.read(buffer);
            }
        } catch (IOException e) {
            e.printStackTrace();
            LOG.error("Invoked AWSUtils getS3Content : json ", e);
        }```