0
votes

Are there any online links or docs which can be used as a guide to setup Delta Lake (without Databricks Runtime) to be used with Kubernetes?

1
Your question might be a little too broad… but basically you just needed storage and point to it. I would probably recommend storage outside of k8s like some block storage, but it would depend on your Infrastructure/ Cloud provider. - jgp
I intend to use AWS S3. - Deb
And EKS for Kube? - jgp
No, without any managed Kubernetes offerings. So basically some manually setup Kubernetes over ec2. - Deb

1 Answers

0
votes

There is a company that is leading the K8S/Spark space, Data Mechanics, they are sharing quite a bit of their info, as in: https://www.datamechanics.co/blog-post/setting-up-managing-monitoring-spark-on-kubernetes/.

The issue with K8S and Spark (and any data) is always the storage. K8S is great for compute and is not as good re: storage and I think that this is where you will have to spend more energy.

Based on what you are saying on the questions, here are some of my thoughts. I used Delta Lake in two scenarios: 1) creation & storage of a "final" zone and 2) intermediate storage in a pipeline.

  1. If you are considering a final/gold zone, I would recommend S3, you have quasi unlimited storage.

  2. If you are storing intermediate results that you will reuse in another process, I would consider AWS EBS attached to your EC2 cluster (https://aws.amazon.com/ebs/). You can pick the performance (= $$$) level based on your SLA, between SDD/HDD. You can provision IOPS for faster throughput.

These recommendations may vary on volumes, throughput, SLA, etc. I hope they help nevertheless.