2
votes

I am connecting to a delta table in Azure gen 2 data lake by mounting in Databricks and creating a table ('using delta'). I am then connecting to this in Power BI using the Databricks connector.

Firstly, I am unclear as to the relationship between the data lake and the Spark table in Databricks. Is it correct that the Spark table retrieves the latest snapshot from the data lake (delta lake) every time it is itself queried? Is it also the case that it is not possible to effect changes in the data lake via operations on the Spark table?

Secondly, what is the best way to reduce the columns in the Spark table (ideally before it is read into Power BI)? I have tried creating the Spark table with specified subset of columns but get a cannot change schema error. Instead I can create another Spark table that selects from the first Spark table, but this seems pretty inefficient and (I think) will need to be recreated frequently in line with the refresh schedule of the Power BI report. I don't know if it's possible to have a Spark delta table that references another Spark Delta table so that the former is also always the latest snapshot when queried?

As you can tell, my understanding of this is limited (as is the documentation!) but any pointers very much appreciated.

Thanks in advance and for reading!

1

1 Answers

0
votes

Table in Spark is just a metadata that specify where the data is located. So when you're reading the table, Spark under the hood just looking up in the metastore for information where data is stored, what schema, etc., and access that data. Changes made on the ADLS will be also reflected in the table. It's also possible to modify table from the tools, but it depends on what access rights are available to the Spark cluster that processes data - you can set permissions either on the ADLS level, or using table access control.

For second part - you just need to create a view over the original table, and that view will select only limited set of columns - the data is not copied and latest updates in the original table will be always available for querying. Something like:

CREATE OR REPLACE VIEW myview
    AS SELECT col1, col2 FROM mytable

P.S. If you're only accessing via PowerBI or other BI tools, you may look onto Databricks SQL (when it will be in the public preview) that is heavily optimized for BI use cases.