Hive queries fail when the hive.execution.engine is set to MR, they work when set to Tez?

Question

I am using HDP 2.1 sandbox for my work. The version of hive as listed by the jar file is: hive-exec-0.13.0.2.1.1.0-385.jar.

I have created a directory in HDFS having weather information. the actual information is in text files with the following 5 fields (usafid:string,obsdate:string,winddir:int,windspeed:int,visibility:double), E.g. file contents are:

725805 201301010853 70 8 10.0
725805 201301010953 350 6 10.0
725805 201301011053 20 11 10.0
725805 201301011153 20 8 10.0

I am now overlaying a HIVE table using the SQL command

CREATE DATABASE weather;
USE weather;
CREATE EXTERNAL TABLE IF NOT EXISTS wind( 
    usafid     STRING,
    obsdate    STRING,
    winddir    INT,
    windspeed  INT,
   visibility DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES  TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/WEATHER/PROCESSED/WIND_RECORDS';

When I run the query SELECT * from wind;, it works fine. But, if I run the query SELECT * from wind WHERE wind = 3;, hive launches a MR job and fails with the following stack trace:

2014-10-29 00:10:58,975 ERROR [IPC Server handler 3 on 52990] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1414566304731_0001_m_000000_0 - exited : java.lang.RuntimeException: java.lang.ArrayIndexOutOfBoundsException: 0
    at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:284)
    at org.apache.hadoop.hive.ql.exec.Utilities.getMapWork(Utilities.java:250)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.init(HiveInputFormat.java:256)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:383)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.pushProjectionsAndFilters(HiveInputFormat.java:376)
    at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:552)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:168)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
    at java.beans.XMLDecoder.readObject(XMLDecoder.java:250)
    at org.apache.hadoop.hive.ql.exec.Utilities.deserializeObject(Utilities.java:679)
    at org.apache.hadoop.hive.ql.exec.Utilities.deserializePlan(Utilities.java:622)
    at org.apache.hadoop.hive.ql.exec.Utilities.getBaseWork(Utilities.java:272)

I did a lot of research and digging and was able to trace the error during the parsing of the "Query Plan". Any query with the 'WHERE' clause fails. If I set the execution engine to Tez using the command, the queries run fine.

set hive.execution.engine=tez;

Not sure what is happening and why it is failing when the hive.execution.engine=mr (which I believe is the default)?

EDIT: I setup a 3 node cluster and used Ambari to install and setup HDP 2.1. I am unable to re-create the problem on the 3 node cluster. Looks like the issue manifests itself only in the standalone VM HDP 2.1.

rsantiago rsantiago · Accepted Answer · 2020-11-23T21:38:58

It is true that mr is the default value; however, please keep in mind that Hive 0.13.0 is being deprecated and "It may be removed without further warning". Please consider that Tez and Spark are better options.

On the other hand, the column wind doesn't exist in your schema. My best guess is that the exception ArrayIndexOutOfBoundsException menas that column wind was not found while Tez could ignore its inexistence and complete the query.

Hive queries fail when the hive.execution.engine is set to MR, they work when set to Tez?

1 Answers