I'm a newbie in the Spark world and struggling with some concepts.
How does parallelism happen when using Spark Structured Streaming sourcing from Kafka ?
Let's consider the following snippet code:
SparkSession spark = SparkSession
.builder()
.appName("myApp")
.getOrCreate();
Dataset<VideoEventData> ds = spark
.readStream()
.format("kafka")
...
gDataset = ds.groupByKey(...)
pDataset = gDataset.mapGroupsWithState(
...
/* process each key - values */
loop values
if value is valid - save key/value result in the HDFS
...
)
StreamingQuery query = pDataset.writeStream()
.outputMode("update")
.format("console")
.start();
//await
query.awaitTermination();
I've read that the parallelism is related with the number of data partitions, and the number of partitions for a Dataset is based on the spark.sql.shuffle.partitions
parameter.
For every batch (pull from the Kafka), will the pulled items be divided among the number of
spark.sql.shuffle.partitions
? For example,spark.sql.shuffle.partitions=5
andBatch1=100
rows, will we end up with 5 partitions with 20 rows each ?Considering the snippet code provided, do we still leverage in the Spark parallelism due to the
groupByKey
followed by amapGroups/mapGroupsWithState
functions ?
UPDATE:
Inside the gDataset.mapGroupsWithState
is where I process each key/values and store the result in the HDFS. So, the output sink is being used only to output some stats in the console.