0
votes

In summary this is what I did:

Original data -> SELECT and save filtered data in HDFS -> create an External table using the file saved in HDFS -> populate an empty table using the External table.

Looking at the Exception, seems this has something todo with OUTPUT types between the two tables

In details :

1) I have "table_log" table with lots of data (in Database A) with the following structure (with 3 partitions) :

CREATE TABLE `table_log`(
  `e_id` string, 
  `member_id` string, 
  .
  .
PARTITIONED BY ( 
  `dt` string, 
  `service_type` string, 
  `event_type` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
  COLLECTION ITEMS TERMINATED BY '\u0002' 
  MAP KEYS TERMINATED BY '\u0003' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

2) I filtered data by (td,service_type,event_type) and saved the result in HDFS as follows :

INSERT OVERWRITE DIRECTORY  '/user/atscale/filterd-ratlog' SELECT * FROM rat_log WHERE dt >= '2016-05-01' AND dt <='2016-05-31' AND service_type='xxxx_jp' AND event_type='vv';

3) Then I created an External Table (table_log_filtered_ext) (in Database B) with above result. Note that this table doesn't have the partitions.

DROP TABLE IF EXISTS table_log_filtered_ext;
CREATE EXTERNAL TABLE `table_log_filtered_ext`(
  `e_id` string, 
  `member_id` string, 
  .
  .
  dt string,
  service_type string,
  event_type string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
  COLLECTION ITEMS TERMINATED BY '\u0002' 
  MAP KEYS TERMINATED BY '\u0003'
LOCATION '/user/atscale/filterd-ratlog'

4) I created another new table (table_log_filtered) similar to the "table_log" structure(with 3 partitions) as :

CREATE TABLE `table_log_filtered` (
  `e_id` string, 
  `member_id` string, 
  .
  .
PARTITIONED BY ( 
  `dt` string, 
  `service_type` string, 
  `event_type` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
  COLLECTION ITEMS TERMINATED BY '\u0002' 
  MAP KEYS TERMINATED BY '\u0003' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

5) Now I wanted to populate "table_log_filtered" table (with 3 partitions as in "table_log") from the data from the external table "table_log_filtered_ext"

SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.execution.engine=tez; 

INSERT OVERWRITE TABLE rat_log_filtered PARTITION(dt, service_type, event_type) 
SELECT * FROM table_log_filtered_ext;

But I get this "java.lang.ClassCastException. Looking at the exception, this has something todo with OUTPUT types between the two tables.. AnyTips ?:

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":
.
.
.
      at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
      at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
      at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
      at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0
      at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
      at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
      ... 16 more
    Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
      at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:81)
      at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
      at org.apache.hadoop.hive.ql.exec.LimitOperator.process(LimitOperator.java:54)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
      at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
      at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
      ... 17 more
1
would you please try and check if the error persists with SET hive.execution.engine=mr; - abhiieor
yes..less stack-trace with MR, but seems like the same error. I think this has something todo with the OUTPUT types between the two tables ? - Ashika Umanga Umagiliya
ORC is a complex columnar format, and the CREATE script should only specify STORED AS ORC. Because ROW FORMAT DELIMITED simply makes no sense (applies only to row formats such as TextFile and SequenceFile), and only masochists use INPUTFORMAT and OUTPUTFORMAT clauses when the SerDe is fully defined by its alias. - Samson Scharfrichter

1 Answers

0
votes

Just in case if anyone else bump into this issue, the fix was as @Samson Scharfrichter mentioned , I specified STORED AS ORC for the table_log_filtered

CREATE TABLE `table_log_filtered` (
  `e_id` string, 
  `member_id` string, 
  .
  .
PARTITIONED BY ( 
  `dt` string, 
  `service_type` string, 
  `event_type` string)
STORED AS ORC