do I need to use coalesce before saving my RDD data to file

0

votes

Imagine I have a RDD with 100 records and I partitioned it with 10, so each partition is now having 10 records I am just converting to rdd to key value pair rdd and saving it to a file now my output data is divided into 10 partitions which is ok to me, but is it best practise to use coalesce function before saving output data to file ? for example rdd.coalesce(1) this gives just one file as output does it not shuffles data insides nodes ? want to know where coalesce should be used.

Thanks

scalaapache-spark

0

votes

Avoid coalesce if you don't need it. Only use it to reduce the amount of files generated.

0

votes

As with anything, depends on your use case; coalesce() can be used to either increase or decrease the number of partitions but there is a cost associated with it.

If you are attempting to increase the number of partitions (in which the shuffle parameter must be set to true), you will incur the cost of redistributing data through a HashPartitioner. If you are attempting to decrease the number of partitions, the shuffle parameter can be set to false but the number of nodes actively grabbing from the current set of partitions will be the number of partitions you are coalescing to. For example, if you are coalescing to 1 partition, only 1 node will be active in pulling data from the parent partitions (this can be dangerous if you are coalescing a large amount of data).

Coalescing can be useful though as sometimes you can make your job run more efficiently by decreasing your partition set size (e.g. after a filter or a sparse inner join).

0

votes

you can simply use it like this

       rdd.coalesce(numberOfPartition)

It doesn't shuffle data if you decease partitions but its shuffle data if you increase partitions. Its according to use cases.But we careful to use it because if you decrease partition less than or not equal to number of cores in your cluster then its cant use full resources of your cluster. And Sometimes less shuffle data or network IO like you decrease rdd partition but equal to number of partition so its increase performance of your system.

do I need to use coalesce before saving my RDD data to file

3 Answers