2
votes

I understand the difference between Internal tables and external tables in hive as below 1) if we drop the internal Table File and metadata will be deleted, however , in case of External only metadata will be deleted 2) if the file data need to be shared by other tools/applications then we go for external table if not internal table, so that if we drop the table(external) data will still be available for other tools/applications

I have gone through the answers for question "Difference between Hive internal tables and external tables? " but still I am not clear about the proper uses cases for Internal Table so my question is why is that I need to make an Internal table ? why cant I make everything as External table?

2
internal table for temp table, external table for others. you don't want to delete HDFS files manually. - Mike Gan

2 Answers

2
votes

Use EXTERNAL tables when: The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn't lock the files. The data is permanent i.e used when needed.

Use INTERNAL tables when: The data is temporary. You want Hive to completely manage the lifecycle of the table and data.

1
votes

Let's understand it with two simple scenarios:

  • Suppose you have a data set, and you have to perform some analytics/problem statements on it. Because of the nature of problem statements, few of them can be done by HiveQL, few of them need Pig Latin and few of them need Map Reduce etc., to get the job done. In this situation External Table comes into picture- the same data set can be used to solve entire analytics instead of having different different copies of same data set for the different different tools. Here Hive don't need authority on the data set because several tools are going to use it.

  • There can be a scenario, where entire analytics/problem statements can be solved by only HiveQL. In such situation Internal Table comes into picture- Means you can put the entire data set into Hive's Warehouse and Hive is going to have complete authority on the data set.