OOM loading data from parquet

Question

we have an Apache Spark 1.4.0 cluster and we would like to load data from a set of 350 parquet file from HDFS. Currently, when we try to run our program, we get a "OutOfMemory Error" driver side. Profiling both an Executor and the driver we noticed that during the operation the executor memory remains constant when the driver memory constantly increase. For each parquet file we load data as follows:

sqlContext.read().format(PARQUET_OUT_TYPE).load("path").toJavaRDD(mappingFunction)

and, after that, we join the RDDs by "union" and then we coalesce them

partitions.reduce((r1,r2) -> r1.union(r2).coalesce(PARTITION_COUNT))

What looks really strange to me is that the executor memory remain constant during the loading phase (when I expect to see it increase 'cause of the data read by the node) and the driver memory constantly increase (when I expect to see it remain constant because that should not be loaded in the driver memory).

Is there something wrong with the way we load data? Could you please explain me how to read data from parquet in parallel?

Thanks

matteo matteo · Accepted Answer · 2017-04-04T10:08:55

0

votes

the OOM was caused by the parquet metadata not by the data.

Thanks

OOM loading data from parquet

1 Answers