1
votes

I'm confused how many connections would be made to the database by Spark in the below scenario:

Let's say I have a Spark program which is running only on one worker node with one executor, and the number of partitions in a dataframe is 10. I want to write this dataframe to Teradata. Since the level of parallelism is 10 but the executor is only 1, will there be 10 connections made while saving the data, or only 1 connection?

2

2 Answers

1
votes

Since Spark 2.2, the numPartitions parameter specified for a JDBC datasource is also used to control its writing behavior (in addition to previous purpose of setting the level of parallelism during read). From Spark docs:

numPartitions
The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.

0
votes

it depend on your spark -> teradata solution.

in general you will have 1 connection per core. each core iterate over own partitions 1 by 1.

for exemple if you use .foreach with custom solution you will have 1 connection at the time for 1 row.

if you use foreachPartition with custom solution you will have 1 connection for many rows.