5
votes

Can one use Delta Lake and not being dependent on Databricks Runtime? (I mean, is it possible to use delta-lake with hdfs and spark on prem only?) If no, could you elaborate why is that so from technical point of view?

4

4 Answers

6
votes

Yes, delta lake has been open sourced by databricks (https://delta.io/). I am using deltalake(0.6.1) along with apache spark(2.4.5) & S3. Many other integrations are also available to accommodate existing tech stack e.g. integration of hive, presto, athena etc. Connectors:https://github.com/delta-io/connectors Integrations: https://docs.delta.io/latest/presto-integration.html & https://docs.delta.io/latest/integrations.html

2
votes

According to this https://vimeo.com/338100834, it is possible to use Delta Lake without Databricks Runtime. Delta Lake is just a lib which "knows" how to write and read transactionally into the table (a collection of parquet files) by maintaining a special transaction log besides each table. Of course, a special connector for external applications (e.g. hive) is needed in order to work with such tables. Otherwise, transactional and consistency guarantees cannot be enforced.

1
votes

According to documentation: https://docs.delta.io/latest/quick-start.html#set-up-apache-spark-with-delta-lake, delta lake has been open-sourced to use with Apache Spark. The integration can be done easily by adding delta lake jar to the code or adding the library to the spark installation path. Hive integration can be done using: https://github.com/delta-io/connectors.

0
votes

Delta lake is an open-source project that enables building a Lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS.

You can find the GitHub Repo for the delta here : https://github.com/delta-io/delta

In Short you can use Delta lake without Databricks runtime also as it is open source but with Databricks you get that as managed commercial offering with some optimisations that you don't get by default.