I am running Hive 1.2 on Hadoop 2.6, and I have loaded a parquet table with 21GB of size, stored in HDFS with replication factor of 1, on 3 nodes. I am running a simple selection query, which returns no rows (mainly to measure the performance of a full table scan):
select * from myParquetTable where id < 0;
But I keep getting Java heap space memory issues from "ParquetFileReader" (close to end of the map-only job):
java.lang.OutOfMemoryError: Java heap space at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:755) at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
While the total size of data is 21GB, I have total of 31.5GB of memory available across all 3 nodes. I am just wondering if Parquet files have high memory consumption issues, and need huge amount of memory available for simple/full scans, or sth else is missing here. (I am pretty new with parquet files, and my previous experiences with ORC format and even larger data sizes on the same HW was successful).
Any suggestion/hint would be appreciated.