Spark write to parquet on hdfs

Question

I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .

When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.

scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")

Is this the intended behaviour or should all blocks be distributed across the cluster?

Thanks

nik nik · Accepted Answer · 2016-11-04T09:44:56

Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide

Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy).

So yes, this is the intended behaivour.

Spark write to parquet on hdfs

2 Answers