Hive Managed Table vs External Table : LOCATION directory

4

votes

I have been going through some HIVE books and tutorials. One of the book - Hadoop in Practice says

When you create an external (unmanaged) table, Hive keeps the data in the directory specified by the LOCATION keyword intact. But if you were to execute the same CREATE command and drop the EXTERNAL keyword, the table would be a managed table, and Hive would move the contents of the LOCATION directory into /user/hive/ warehouse/stocks, which may not be the behavior you expect.

I created a managed table with LOCATION keyword. And then loaded data into the table from a HDFS file. But I could not see any directory created under /user/hive/warehouse. Rather the new directory was created in LOCATION mentioned. So I think if I create a MANAGED table with LOCATION mentioned then there is nothing created in Hive warehouse directory ? Is this understanding correct ?

Also if the location of the input file during LOAD command is hdfs, then internal or external table both will move the data to their location. Is this understanding also correct ?

hadoophive

2

votes

In both cases(managed or external) Location is optional so whenever you specify LOCATION data will be stored on the same HDFC LOCATION PATH irrespective of which table you are creating(managed or external). And, if you don't use LOCATION, default location path which is mentioned in hive-site.xml is considered.

2

votes

First of all when you create a managed table with location keyword, it does not create a directory at the specified location, rather it will give you an Exception:FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://path/of/the/given/location is not a directory or unable to create one).
This means that in the DDL, the location that you have given, first needs the directory to be present, else the above given Exception will be thrown.
Next you can create the DDL with location given.
Next you can use a select * from <table> command to view the data (without having to load data).
But when you drop this table, your data is also gone from hdfs (unlike External tables) and also gone is the metadata.
This is the primary difference between a managed table with location keyword. It behaves partly like external table, and partly like managed table.
External, as in, you dont have to load the data, and you just specify the location.
Managed, as in, you drop the table, the data is also deleted.
Hope that makes sense.

0

votes

When you create a table with the location keyword it will point the table to the location. The location specifies the path in hdfs to the data files.

CREATE EXTERNAL TABLE IF NOT EXISTS mydb.contacts (
  name         STRING ,
  -- ... other variables
  city         STRING ,
LOCATION '/user/hive/warehouse/mydb.db/contacts';

When ever you specify a location you have to make sure that you place the data files there. In the above example we are explicitly telling hive where the data is in the external table. If we didn't specify a table the default location would like below, and this is true for any table produced without location statement unless your sysadmin changes the default of course.

/user/hive/warehouse/databasename.db/contacts

0

votes

Managed and unmanaged tables Every Spark SQL table has metadata information that stores the schema and the data itself.

A managed table is a Spark SQL table for which Spark manages both the data and the metadata. In the case of managed table, Databricks stores the metadata and data in DBFS in your account. Since Spark SQL manages the tables, doing a DROP TABLE example_data deletes both the metadata and data.

Some common ways of creating a managed table are:

CREATE TABLE <example-table>(id STRING, value STRING)

Another option is to let Spark SQL manage the metadata, while you control the data location. We refer to this as an unmanaged table. Spark SQL manages the relevant metadata, so when you perform DROP TABLE , Spark removes only the metadata and not the data itself. The data is still present in the path you provided.

You can create an unmanaged table with your data in data sources such as Cassandra, JDBC table, and so on. See Data sources for more information about the data sources supported by Databricks. Some common ways of creating an unmanaged table are:

CREATE TABLE <example-table>(id STRING, value STRING) USING org.apache.spark.sql.parquet OPTIONS (PATH "<your-storage-path>")

Hive Managed Table vs External Table : LOCATION directory

4 Answers