we have an Apache Spark 1.4.0 cluster and we would like to load data from a set of 350 parquet file from HDFS. Currently, when we try to run our program, we get a "OutOfMemory Error" driver side. Profiling both an Executor and the driver we noticed that during the operation the executor memory remains constant when the driver memory constantly increase. For each parquet file we load data as follows:
sqlContext.read().format(PARQUET_OUT_TYPE).load("path").toJavaRDD(mappingFunction)
and, after that, we join the RDDs by "union" and then we coalesce them
partitions.reduce((r1,r2) -> r1.union(r2).coalesce(PARTITION_COUNT))
What looks really strange to me is that the executor memory remain constant during the loading phase (when I expect to see it increase 'cause of the data read by the node) and the driver memory constantly increase (when I expect to see it remain constant because that should not be loaded in the driver memory).
Is there something wrong with the way we load data? Could you please explain me how to read data from parquet in parallel?
Thanks