I want to execute SQL query on a DB which is in Azure SQL managed instance using Azure Databricks. I have connected to DB using spark connector.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val config = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"queryCustom" -> "SELECT TOP 100 * FROM dbo.Clients WHERE PostalCode = 98074" //Sql query
"user" -> "username",
"password" -> "*********",
))
//Read all data in table dbo.Clients
val collection = sqlContext.read.sqlDB(config)
collection.show()
I am using above method to fetch the data(Example from MSFT doc). Table sizes are over 10M in my case. My question is How does Databricks process the query here?
Below is the documentation: The Spark master node connects to databases in SQL Database or SQL Server and loads data from a specific table or using a specific SQL query. The Spark master node distributes data to worker nodes for transformation. The Worker node connects to databases that connect to SQL Database and SQL Server and writes data to the database. User can choose to use row-by-row insertion or bulk insert.
It says master node fetches the data and distributes the work to worker nodes later. In the above code, while fetching the data what if the query itself is complex and takes time? Does it spread the work to worker nodes? or I have to fetch the tables data first to Spark and then run the SQL query to get the result. Which method do you suggest?