How to write data to Delta Lake from Kubernetes

Question

Our organisation runs Databricks on Azure that is used by data scientists & analysts primarily for Notebooks in order to do ad-hoc analysis and exploration.

We also run Kubernetes clusters for non spark-requiring ETL workflows.

We would like to use Delta Lakes as our storage layer where both Databricks and Kubernetes are able to read and write as first class citizens.
Currently our Kubernetes jobs write parquets directly to blob store, with an additional job that spins up a databricks cluster to load the parquet data into Databrick's table format. This is slow and expensive.

What I would like to do is write to Delta lake from Kubernetes python directly, as opposed to first dumping a parquet file to blob store and then triggering an additional Databricks job to load it into Delta lake format.
Conversely, I'd like to also leverage Delta lake to query from Kubernetes.

In short, how do I set up my Kubernetes python environment such that it has equal access to the existing Databricks Delta Lake for writes & queries?
Code would be appreciated.

Alex Ott Alex Ott · Accepted Answer · 2021-08-14T11:14:08

You can usually can write into the Delta table using Delta connector for Spark. Just start a Spark job with necessary packages and configuration options:

spark-submit --packages io.delta:delta-core_2.12:1.0.0 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" 
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" 
...

and write the same way as on Databricks:

df.write.format("delta").mode("append").save("some_location")

But by using OSS version of Delta you may loose some of the optimizations that are available only on Databricks, like, Data Skipping, etc. - in this case performance for the data written from Kubernetes could be lower (really depends on how do you access data).

There could be a case when you couldn't write into Delta table create by Databricks - when the table was written by writer with writer version higher that supported by OSS Delta connector (see Delta Protocol documentation). For example, this happens when you enable Change Data Feed on the Delta table that performs additional actions when writing data.

Outside of Spark, there are plans for implementing so-called Standalone writer for JVM-based languages (in addition to existing Standalone reader). And there is a delta-rs project implemented in Rust (with bindings for Python & Ruby) that should be able to write into Delta table (but I haven't tested that myself)

How to write data to Delta Lake from Kubernetes

1 Answers