Are there any online links or docs which can be used as a guide to setup Delta Lake (without Databricks Runtime) to be used with Kubernetes?
1 Answers
There is a company that is leading the K8S/Spark space, Data Mechanics, they are sharing quite a bit of their info, as in: https://www.datamechanics.co/blog-post/setting-up-managing-monitoring-spark-on-kubernetes/.
The issue with K8S and Spark (and any data) is always the storage. K8S is great for compute and is not as good re: storage and I think that this is where you will have to spend more energy.
Based on what you are saying on the questions, here are some of my thoughts. I used Delta Lake in two scenarios: 1) creation & storage of a "final" zone and 2) intermediate storage in a pipeline.
If you are considering a final/gold zone, I would recommend S3, you have quasi unlimited storage.
If you are storing intermediate results that you will reuse in another process, I would consider AWS EBS attached to your EC2 cluster (https://aws.amazon.com/ebs/). You can pick the performance (= $$$) level based on your SLA, between SDD/HDD. You can provision IOPS for faster throughput.
These recommendations may vary on volumes, throughput, SLA, etc. I hope they help nevertheless.