0
votes

The following HQL works to create a Hive table in HDInsight which I can successfully query. But, I have several questions about WHY it works:

  1. My data rows are, in fact, terminated by carriage return line feed, so why does 'COLLECTION ITEMS TERMINATED BY \002' work? And what is \002 anyway? And no location for the blob is specified so, again, why does this work?

  2. All attempts at creating the same table and specifying "CREATE EXTERNAL TABLE...LOCATION '/user/hive/warehouse/salesorderdetail'" have failed. The table is created but no data is returned. Leave off "external" and don't specify any location and suddenly it works. Wtf?

    CREATE TABLE IF NOT EXISTS default.salesorderdetail(
        SalesOrderID int,
        ProductID int,
        OrderQty int,
        LineTotal decimal
        )
    ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ','
        COLLECTION ITEMS TERMINATED BY '\002'
        MAP KEYS TERMINATED BY '\003'
    STORED AS TEXTFILE

Any insights are greatly appreciated.

UPDATE:Thanks for the help so far. Here's the exact syntax I'm using to attempt external table creation. (I've only changed the storage account name.) I don't see what I'm doing wrong.

drop table default.salesorderdetailx;

CREATE EXTERNAL TABLE default.salesorderdetailx(SalesOrderID int,
       ProductID int,
       OrderQty int,
       LineTotal decimal)
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ','
        COLLECTION ITEMS TERMINATED BY '\002'
        MAP KEYS TERMINATED BY '\003'
STORED AS TEXTFILE
LOCATION 'wasb://[email protected]/mycn-1/hive/warehouse/salesorderdetailx'
1
By default if you do not list any storage, it assumes you're already talking about the blob storage that the HDInsight cluster is already referencing.Phuc H Duong
To reference other storage, you need to use the following syntax, which is different from normal hadoop clusters. wasb://<containername>@<accountname>.blob.core.windows.net/<path>Phuc H Duong
Thanks! I've actually tried using the "wasb" syntax but still had no luck with it, for whatever reason. Only internal has worked so far. I'm also still wondering why the default collection delimiter of \002 is working. Does "collection" not imply row/tuple as I'm interpreting it? My rows are terminated with CRLF.Jason

1 Answers

2
votes
  1. When you create your cluster in HDInsight, you have to specify underlying blob storage. It assumes that you are referencing that blob storage. You don't need to specific a location because your query is creating an internal table (see answer #2 below) which is created at a default location. External tables need to specify a location in Azure blob storage (outside of the cluster) so that the data in the table is not deleted when the cluster is dropped. See the Hive DDL for more information.

  2. By default, tables are created as internal, and you have to specify the "external" to make them external tables.

    Use EXTERNAL tables when:

    • Data is used outside Hive
    • You need data to be updateable in real time
    • Data is needed when you drop the cluster or the table
    • Hive should not own data and control settings, directories, etc.

Use INTERNAL tables when:

  • You want Hive to manage the data and storage
  • Short term usage (like a temp table)
  • Creating table based on existing table (AS SELECT)

Does the container "user/hive/warehouse/salesorderdetail" exist in your blob storage? That might explain why it is failing for your external table query.