0
votes

I'm trying to understand, what is partition in Spark ?

My understanding is, When we read from a source and place into any specific Datatset, then that Data set can be split into multiple sub-Datasets, those sub-Datasets are called partitions And Its upto spark framework where and how it distributes in cluster. Is it correct ?

I came into a doubt, when I read some online articles, which says

Under the hood, these RDDs or Datasets are stored in partitions on different cluster nodes. Partition basically is a logical chunk of a large distributed data set

This statment breaks my under standing. As per the above statment, RDDs or Datasets sits inside partition. But I was thinking RDD itself as a partition (after splitiing).

Can anyone help me to clear this doubt ?

Here is my code snippet, where am reading from JSON.

Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
                .json(JsonPath);

So while reading itself, how can I split this into multile partitions ? or Any other way around ?

1

1 Answers

1
votes

What is Partition?

As per spark documentation, A partition in spark is an atomic chunk of data (a logical division of data) stored on a node in the cluster. Partitions are basic units of parallelism in Apache Spark. RDDs/Dataframe/Dataset in Apache Spark is a collection of partitions.

So, When you do

Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
                .json(JsonPath);

spark reads your source json data and create a (logical division on data which are paritions) and then process those partitions parallely in cluster.

For example in laymen terms... If you have a task to move 1-ton load of wheat from one place to another place and you have only 1 men resource(similar to a single thread) to do that task.so there can be a lot of possibilities over here. 1)Your resource might not be able to move such a huge weight at a time. (similar to you don't have enough CPU or RAM) 2)If It is capable(similar to high conf machine) then It might take a huge time and It might have stressed out. 3) AND your resource can't process any other process in between when It is doing load transfer. and soon.....

what if you divide 1-ton load of wheat into 1kg wheat blocks(similar to logical partitions on data) and hire more men and then ask your resources to move. now it is a lot easier for them and you can add few more men resources(similar to scaling up the cluster) and can achieve your actual task very easily and fast.

similar to the above approach spark does a logical division on data so that you can process data parallelly using your cluster resources optimally and can finish your task much faster.

Note: RDD/Dataset and Dataframe are just abstractions for logical partitions of data. and there other concepts in RDD and Dataframe which I didn't cover in the example (i.e Resilient and immutability)

How can I split this into multiple partitions ?

you can use repartition API to split furthermore partitions

spark.read().schema(Jsonreadystructure.SCHEMA)
                    .json(JsonPath).**repartition**(number)

and you can use coalesce() api to bring down partitions.