0
votes

hadoop version: Hadoop 2.6.0-cdh5.12.2 hive version: Hive 1.1.0-cdh5.12.2

Considre two tables: products - stores product Id and other details about the product activity - stores user_id, product_id which tells which user purchased which product and other transaction details.

before creating these tables I added SerDe JAR using below command: add jar /home/ManojKumarM_R/json-serde-1.3-jar-with-dependencies.jar;

CREATE EXTERNAL TABLE IF NOT EXISTS products (id string,name string,reseller 
string,category string,price Double,discount Double,profit_percent Double) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 
"/user/ManojKumarM_R/ProductsMergeEnrichOut";

sample data in /user/ManojKumarM_R/ProductsMergeEnrichOut

{"Id":"P101", "Name":"Round Tee", "Reseller":"Nike", "Category":"Top Wear", "Price":2195.03, "Discount":21.09, "Profit_percent":23.47}

{"Id":"P102", "Name":"Half Shift", "Reseller":"Nike", "Category":"Top Wear", "Price":1563.84, "Discount":23.83, "Profit_percent":17.12}

CREATE EXTERNAL TABLE IF NOT EXISTS activity (product_id string,user_id 
string,cancellation boolean ,return boolean,cancellation_reason 
string,return_reason string, order_date timestamp, shipment_date timestamp, 
delivery_date timestamp , cancellation_date timestamp,  return_date 
timestamp) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' location 
"/user/ManojKumarM_R/ActivityMergeEnrichOut/";

sample data in /user/ManojKumarM_R/ActivityMergeEnrichOut/

{"Product_id":"P117", "User_id":"U148", "Cancellation":"TRUE", "Return":"NA", "Cancellation_reason":"Duplicate Product", "Return_reason":"NA", "Order_date":"2016-02-12", "Shipment_date":"NA", "Delivery_date":"NA", "Cancellation_date":"2018-05-20", "Return_date":"NA"}

{"Product_id":null, "User_id":"U189", "Cancellation":"FALSE", "Return":"FALSE", "Cancellation_reason":"NA", "Return_reason":"NA", "Order_date":"2017-04-22", "Shipment_date":"2017-05-05", "Delivery_date":"2017-09-09", "Cancellation_date":"NA", "Return_date":"NA"}

table creation was successful,

select * from products; 

&

select * from activity; 

queries work absolutely fine thus denoting that SerDe JAR is picked during select query.

However, when I run below join query: I want to join these two tables on a common column which is Product Id

SELECT a.user_id, p.category FROM activity a JOIN products p  
ON(a.product_id = p.Id);

it fails with below message

Execution log at: /tmp/ManojKumarM_R/ManojKumarM_R_20181010124747_690490ae-e59f-4e9d-9159-5c6a6e28b951.log 2018-10-10 12:47:43 Starting to launch local task to process map join; maximum memory = 2058354688 Execution failed with exit status: 2 Obtaining error information

Task failed! Task ID: Stage-5

Log in /tmp/ManojKumarM_R/ManojKumarM_R_20181010124747_690490ae-e59f-4e9d-9159-5c6a6e28b951.log

2018-10-10 12:47:43,984 ERROR [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInProcess(398)) - Hive Runtime Error: Map local work failed org.apache.hadoop.hive.ql.metadata.HiveException: Failed with exception java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDejava.lang.RuntimeException: java.lang.ClassNotFoundException: org.openx.data.jsonserde.JsonSerDe at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:73)

which denotes that Hive is not able to find JsonSerDe JAR even though I've added JAR during that hive session and selct queries were working fine. If anyone has resolved similar issue please let me know, am not sure if Hive looks in different directories for JARs during JOIN operation.

1

1 Answers

0
votes

Hive doesn't invoke MR jobs for all "SELECT *" queries. In your case, the JAR file is not propagated across clusters when actual MR job (JOIN query) being invoked. So, I would recommend you to re-check the JAR folder/file permission or move the file to the HIVE library path and also update the Hive-site.xml. There are couple of previous post on how to add HIVE JAR file and you can check that also.

Previous post.

how to add a jar file in hive