Delta Lake setup with Kubernetes

Question

Are there any online links or docs which can be used as a guide to setup Delta Lake (without Databricks Runtime) to be used with Kubernetes?

Your question might be a little too broad… but basically you just needed storage and point to it. I would probably recommend storage outside of k8s like some block storage, but it would depend on your Infrastructure/ Cloud provider. — jgp
No, without any managed Kubernetes offerings. So basically some manually setup Kubernetes over ec2. — Deb

jgp jgp · Accepted Answer · 2021-11-19T20:15:55

There is a company that is leading the K8S/Spark space, Data Mechanics, they are sharing quite a bit of their info, as in: https://www.datamechanics.co/blog-post/setting-up-managing-monitoring-spark-on-kubernetes/.

The issue with K8S and Spark (and any data) is always the storage. K8S is great for compute and is not as good re: storage and I think that this is where you will have to spend more energy.

Based on what you are saying on the questions, here are some of my thoughts. I used Delta Lake in two scenarios: 1) creation & storage of a "final" zone and 2) intermediate storage in a pipeline.

If you are considering a final/gold zone, I would recommend S3, you have quasi unlimited storage.
If you are storing intermediate results that you will reuse in another process, I would consider AWS EBS attached to your EC2 cluster (https://aws.amazon.com/ebs/). You can pick the performance (= $$$) level based on your SLA, between SDD/HDD. You can provision IOPS for faster throughput.

These recommendations may vary on volumes, throughput, SLA, etc. I hope they help nevertheless.

Delta Lake setup with Kubernetes

1 Answers