Databricks: global unmanaged table, partition metadata sync guarantees

Question

Objective

I want to create Databricks global unmanaged tables from ADLS data and use them from multiple clusters (automated and interactive). So I'm doing CREATE TABLE my_table ... first, then MSCK REPAIR TABLE my_table. I'm using Databricks internal Hive metastore.

The issue

Sometimes MSCK REPAIR wasn't synced across clusters (at all, for hours). Means cluster #1 saw partitions immediately, while cluster #2 didn't see any data for some time.

Sometimes it's synced, still I can't understand why it doesn't work in other cases.

Question

Does Databricks use separate internal hive metastore per cluster? If yes, are there any guarantees about sync-up between clusters?

Andrew Corson Andrew Corson · Accepted Answer · 2020-08-28T18:34:32

I believe each databricks deployment has a single hive metastore: https://docs.databricks.com/data/metastores/index.html.

So if the metastore is being updated immediately, then the next most likely problem is that the old table metadata is being cached, so you aren't seeing the updates. Have you tried running

REFRESH <database>.<table>;

on the cluster that was having the sync issues?

Databricks: global unmanaged table, partition metadata sync guarantees

Objective

The issue

Question

1 Answers