Data Governance solution for Databricks, Synapse and ADLS gen2

Question

I'm new to data governance, forgive me if question lack some information.

Objective

We're building data lake & enterprise data warehouse from scratch for mid-size telecom company on Azure platform. We're using ADLS gen2, Databricks and Synapse for our ETL processing, data science, ML & QA activities.

We already have about a hunder of input tables and 25 TB/yearly. In future we're expecting more.

Business has a strong requirements incline towards cloud-agnostic solutions. Still they are okay with Databricks since it's available on AWS and Azure.

Question

What is the best Data Governance solution for our stack and requirements?

My workarrounds

I haven't used any data governance solutions yet. I like AWS Data Lake solution, since it provide basic functionality out-of-the-box. AFAIK, Azure Data Catalog is outdated, because it doesn't support ADLS gen2.

After very quick googling I found three options:

Databricks Privacera
Databricks Immuta
Apache Ranger & Apache Atlas.

Currently I'm not even sure if the 3rd option has full support for our Azure stack. Moreover, it will have much bigger development (infrastructure definition) effort. So is there any reasons I should look into Ranger/Atlas direction?

What are the reasons to prefer Privacera over Immuta and vice versa?

Are there any other options I should evaluate?

What is already done

From Data Governance perspective we have done only the following things:

Define data zones inside ADLS
Apply encryption/obfuscation for sensitive data (due to GDPR requirements).
Implemented Row-Level Security (RLS) at Synapse and Power BI layers
Custom audit framework for logging what & when was persisted

Things to be done

Data lineage and single source of truth. Even at 4 months from the start, it become a pain-point to understand dependencies between data sets. The lineage information is stored inside Confluence, it's hard to maintain and continuously update in multiple places. Even now it's outdated in some places.
Security. Business users may do some data exploration in Databricks Notebooks in future. We need RLS for Databricks.
Data Life Cycle management.
Maybe other data governance related stuff, such as data quality, etc.

Sumit Sarkar Sumit Sarkar · Accepted Answer · 2020-05-12T22:16:53

To better understand option #2 that you cited for data governance on Azure, here is a how-to tutorial demonstrating the experience of applying RLS on Databricks; a related Databricks video demo; and other data governance tutorials.

Full disclosure: My team produces content for data engineers at Immuta and I hope this helps save you some time in your research.