1
votes

I am trying to work out how to explode a complex type in Hive. I have the following Avro file that I want to use for my test and have build a Hive external table over it.

Here is my test data.

{"order_id":123456,"customer_id":987654,"total":305,"order_details":[{"quantity":5,"total":55,"product_detail":{"product_id":1000,"product_name":"Hugo Boss XY","product_description": {"string": "Hugo Xy Men 100 ml"}, "product_status": "AVAILABLE", "product_category":["fragrance","perfume"],"price":10.35,"product_hash":"XY123"}},{"quantity":5,"total":250,"product_detail":{"product_id":2000,"product_name":"Cherokee Polo T Shirt","product_description": {"string": "Cherokee Medium Blue Polo T Shirt"}, "product_status": "AVAILABLE", "product_category":["T-shirts","V-Neck","Cotton", "Medium"],"price":50.00,"product_hash":"XY789"}}]}
{"order_id":789012,"customer_id":4567324,"total":220,"order_details":[{"quantity":10,"total":120,"product_detail":{"product_id":1001,"product_name":"Hugo Men Red","product_description": {"string": "Hugo Men Red 150 ml"}, "product_status": "ONLY_FEW_LEFT", "product_category":["fragrance","perfume"],"price":12.99,"product_hash":"XY456"}},{"quantity":10,"total":100,"product_detail":{"product_id":2001,"product_name":"Ruggers Smart","product_description": {"string": "Ruggers Smart White Small Polo T Shirt"}, "product_status": "ONLY_FEW_LEFT", "product_category":["T-shirts","Round-Neck","Woolen", "Small"],"price":9.99,"product_hash":"XY987"}}]}

Avro schema

{
   "namespace":"com.treselle.db.model",
   "type":"record",
   "doc":"This Schema describes about Order",
   "name":"Order",
   "fields":[
     {"name":"order_id","type": "long"},
     {"name":"customer_id","type": "long"},
     {"name":"total","type": "float"},
     {"name":"order_details","type":{
        "type":"array",
        "items": {
          "namespace":"com.treselle.db.model",
          "name":"OrderDetail",
          "type":"record",
          "fields": [
            {"name":"quantity","type": "int"},
            {"name":"total","type": "float"},
            {"name":"product_detail","type":{
               "namespace":"com.treselle.db.model",
               "type":"record",
               "name":"Product",
               "fields":[
                 {"name":"product_id","type": "long"},
                 {"name":"product_name","type": "string","doc":"This is the name of the product"},
                 {"name":"product_description","type": ["string", "null"], "default": ""},
                 {"name":"product_status","type": {"name":"product_status", "type": "enum", "symbols": ["AVAILABLE", "OUT_OF_STOCK", "ONLY_FEW_LEFT"]}, "default":"AVAILABLE"},
                 {"name":"product_category","type":{"type": "array", "items": "string"}, "doc": "This contains array of categories"},
                 {"name":"price","type": "float"},
                 {"name": "product_hash", "type": {"type": "fixed", "name": "product_hash", "size": 5}}
               ]
             }
            }
          ]
        }
      }
    }
  ]
}

My Hive DDL

CREATE EXTERNAL TABLE orders (
    order_id bigint,
    customer_id bigint,
    total float,
    order_items array<
       struct<
          quantity:int,
          total:float,
          product_detail:struct<
             product_id:bigint,
             product_name:string,
             product_description:string,
             product_status:string,
             product_caretogy:array<string>,
             price:float,
             product_hash:binary
          >
       >
    >
)
STORED AS AVRO
LOCATION '/user/hive/test/orders';

Queries

SELECT order_id, customer_id FROM orders;

This works fine and returns the results from the 2 rows as expected.

But when I try to use explode with lateral view I hit problems.

SELECT
   order_id,
   customer_id,
   ord_dets.quantity as line_qty,
   ord_dets.total as line_total
FROM 
   orders 
   LATERAL VIEW explode(order_items) exploded_table as ord_dets;

This query runs okay, but does not produce any results.

Any pointers as to what it wrong here?

2

2 Answers

0
votes

The reason is that in your schema you defined order_items but in the data and the avro schema the field is called order_details. Hive looks for order_items and thinks it's a non-existent field and defaults to null.

0
votes

Thanks for the pointer.

When I corrected that error I got errors at query time... OK Failed with exception java.io.IOException:org.apache.avro.AvroTypeException: Found com.treselle.db.model.order_details, expecting union

After further analysis I found both the enum type and the fixed type in the avro file caused the "expecting union" error. After removing those columns I was able to query the Hive table successfully.