Memory issues in Hive with Parquet files

Question

I am running Hive 1.2 on Hadoop 2.6, and I have loaded a parquet table with 21GB of size, stored in HDFS with replication factor of 1, on 3 nodes. I am running a simple selection query, which returns no rows (mainly to measure the performance of a full table scan):

select * from myParquetTable where id < 0;

But I keep getting Java heap space memory issues from "ParquetFileReader" (close to end of the map-only job):

java.lang.OutOfMemoryError: Java heap space at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755) at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)

While the total size of data is 21GB, I have total of 31.5GB of memory available across all 3 nodes. I am just wondering if Parquet files have high memory consumption issues, and need huge amount of memory available for simple/full scans, or sth else is missing here. (I am pretty new with parquet files, and my previous experiences with ORC format and even larger data sizes on the same HW was successful).

Any suggestion/hint would be appreciated.

Harman Harman · Accepted Answer · 2015-07-30T08:44:42

You need to keep 2 things in mind: 1. Parquet is columnar based storage. 2. Parquet files are compressed.

Well, given these points the files would take up more space than the original once deflated. However, the memory you have is enough to process the file (complete file in this case).

Since, it gives you out of memory error for heap space - you might want to increase the Java heap size of node manager. Also, you might want to check on how much memory is configured for each container and total memory for all the containers.

Another property you might want to look at would be Java Heap size for Hive client.

Memory issues in Hive with Parquet files

1 Answers