I'm trying to understand, what is partition in Spark ?
My understanding is, When we read from a source and place into any specific Datatset, then that Data set can be split into multiple sub-Datasets, those sub-Datasets are called partitions And Its upto spark framework where and how it distributes in cluster. Is it correct ?
I came into a doubt, when I read some online articles, which says
Under the hood, these
RDDs or Datasetsare stored in partitions on different cluster nodes. Partition basically is a logical chunk of a large distributed data set
This statment breaks my under standing. As per the above statment, RDDs or Datasets sits inside partition. But I was thinking RDD itself as a partition (after splitiing).
Can anyone help me to clear this doubt ?
Here is my code snippet, where am reading from JSON.
Dataset<Row> ds = spark.read().schema(Jsonreadystructure.SCHEMA)
.json(JsonPath);
So while reading itself, how can I split this into multile partitions ? or Any other way around ?