How many connections to the database from Spark while writing a dataframe?

Question

I'm confused how many connections would be made to the database by Spark in the below scenario:

Let's say I have a Spark program which is running only on one worker node with one executor, and the number of partitions in a dataframe is 10. I want to write this dataframe to Teradata. Since the level of parallelism is 10 but the executor is only 1, will there be 10 connections made while saving the data, or only 1 connection?

mazaneicha mazaneicha · Accepted Answer · 2020-11-04T23:16:56

Since Spark 2.2, the numPartitions parameter specified for a JDBC datasource is also used to control its writing behavior (in addition to previous purpose of setting the level of parallelism during read). From Spark docs:

numPartitions
The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

How many connections to the database from Spark while writing a dataframe?

2 Answers