2
votes

We are using Databricks and its SQL DW connector to load data into Synapse. I have a dataset with 10 000 rows and 40 columns. It takes 7 minutes!

Loading same dataset using Data Factory with Polybase and staging option takes 27 seconds. Same with bulk copy.

What could be wrong? Am I missing some configuration? Or is this business as usual?

Connection configuration:

df_insert.write .format("com.databricks.spark.sqldw") .option("url", sqlDwUrlSmall) .option("dbtable", t_insert) .option( "forward_spark_azure_storage_credentials","True") .option("tempdir", tempDir) .option("maxStrLength", maxStrLength) .mode("append") .save()
1
What DWU are you using? What resource class is associated with the user doing the load?wBob
We are using the smallest DWU, DW100c. Databricks user is in largerc resource class, Data Factory user in default smallrc. I just noticed from documentation, that there is no difference between these resource classes on our DWU level. Both get 25 %. One additional note, reads from Synapse with Databricks are really fast.Tero Kruth
100 is the lowest DWU and not really suitable for much. Try higher ones, 400, 1000 to see if you get a different result. If not, come back.wBob
I would like to point out, that with same database level I get totally different performance results depending on if I use Data Factory or Databricks. So adjusting database level is not an answer I'm looking for.Tero Kruth
@tero: I am facing the exact same problem...if you found a solution could you please post it as an answersab

1 Answers

1
votes

You can try to change the write semantics: Databricks documentation

Using the copy write semantics I was able to load data in Synapse faster.

You can configure it before running the write command, in this way:

spark.conf.set("spark.databricks.sqldw.writeSemantics", "copy")