20
votes

I have many doubts related to Spark + Delta. enter image description here

1) Databricks propose 3 layers (bronze, silver, gold), but in which layer is recommendable to use for Machine Learning and why? I suppose they propose to have the data clean and ready in the gold layer.

2) If we abstract the concepts of these 3 layers, can we think the bronze layer as a Data Lake, the silver layer as databases, and the gold layer as a data warehouse? I mean in terms of functionality, .

3) Delta architecture is a commercial term, or is an evolution of Kappa Architecture, or is a new trending architecture as Lambda and Kappa architecture? What are the differences between (Delta + Lambda Architecture) versus Kappa Architecture?

4) In many cases Delta + Spark scale a lot more than most databases for usually much cheaper, and if we tune things right, we can get almost 2x faster queries results. I know is pretty complicated to compare the actual trending data warehouses versus the Feature/Agg Data Store, but I would like to know how can I make this comparison?

5) I used to use Kafka, Kinesis, or Event Hub for streaming process, and my question is what kind of problems can happens if we replace these tools by a Delta Lake table (I already know that everything depends of many things, but I would like to have a general vision of that).

2

2 Answers

16
votes

1) Leave it up to your data scientists. They should be comfortable working in the silver and gold regions, some more advanced data scientists will want to go back to raw data and parse out additional information that may not have been included in the silver/gold tables.

2) Bronze = raw data in native format/delta lake format. Silver = sanitized and cleaned data in delta lake. Gold = data that is accessed via the delta lake or pushed to a data warehouse, depending on business requirements.

3) Delta architecture is an easy version of lambda architecture. Delta architecture is a commercial term at this point, we'll see if that changes in the future.

4) Delta Lake + Spark is the most scalable data storage mechanism with a reasonable price. You're welcome to test the performance based on your business requirements. Delta lake will be far cheaper than any data warehouse for storage. Your requirements around data access and latency will be the larger question.

5) Kafka, Kinesis or Eventhub are sources for getting data from the edge to the data lake. Delta lake can act as a source and sink to a streaming application. There are actually very few problems using delta as a source. The delta lake source lives on blob storage so we actually get around many problems of the infrastructure issues, but add the consistentcy issues of the blob storage. Delta lake as a source of streaming jobs is way more scalable than a kafka/kinesis/event hub, but you still need those tools to get data from the edge into the delta lake.

2
votes
  1. The medallion tables are a recommendation based on how our customers are using Delta lake. You do not have to follow it exactly; however, it does align nicely to how people design EDW's. As for machine learning and which table to use. That is going to be a choice by the folks doing machine learning. Some may want to access the Bronze tables because that is the raw data, nothing has been done to it. Others may want the Silver table because it is presumed to be clean albeit augmented. Usually the Gold tables are highly refined and specific to answering well defined business questions.

  2. Not exactly. The Bronze tables are the raw event data, e.g. one row per event or measurement, etc. The Silver tables are also at the event/measurement level, but they are highly refined and are ready to for queries, reporting, dashboards etc. The Gold table can be fact and dimension tables, aggregate tables, or curated data sets. It is important to remember that Delta is not meant to be used as a transnational, OLTP system. It is really meant for OLAP workloads.

  3. Delta architecture is a the name we gave a particular implementation of Delta Lake. It is not a commercial term per se but hopefully it becomes one. There is enough information out there to compare and contrast Kappa and Lambda architectures. The Delta architecture is well defined throughout Delta documentation and Databricks blogs, tech talks, YouTube videos, etc.

  4. I would ask exactly what it is you want to compare? Speed, features, products, ...?

  5. Delta Lake is not trying to replace any messaging pub/sub systems, they have different use cases. Delta Lake can connect to each of the product you mention both as a subscriber and publisher. Don't forget that Delta Lake is an open storage layer that bring ACID compliant transactions, high performance, and high reliability to data lakes.

Louis.