1
votes

I'm using Azure HDInsights, Azure Data Lake and Hive via Ambari.

I'm setting up a test environment. The original environment's data is stored on Azure Data Lake, in the form of ORC files loaded via Hive. I copied all the data from the original Data Lake to the test Data Lake via Data Factory successfully.

When I try to create my Hive ORC tables in the test environment and then query them no records are returned. Schema/Folder locations on the respective data lakes are the same, am I missing something related to the metastore since it's a different one on test?

Edit: I want to add that I set up an external table to the Test environment's Data Lake in SQL Datawarehouse using Polybase and that is able to read the data just fine.

1
If your tables are partitioned, you need to run MSCK REPAIR TABLE, in order to make Hive to discover manually copied partitions.chemikadze

1 Answers

0
votes

As chemikadze mentioned, running MSCK REPAIR TABLE <your-table> fixed it. My tables were partitioned and so the metastore didn't know to look in certain sub-folders for locating the data.

The following pattern now helps me accomplish an environment duplication:

  1. Create Data Factory Pipeline to copy Data Lake folders from Dev -> Test.
  2. Run Hive DDL on Test environment.
  3. Run repair table command on each of the partitioned tables created in Test environment.