1
votes

I have a field of type

array<struct<id:string>>    

in a partitioned parquet table created in hive. When I change the type of this column to add a field in the structure part, I get an error when exploding the array in a certain select. Here are the details:

Create table statement

CREATE EXTERNAL TABLE `test_table`(
  `my_array` array<struct<id:string>>)
PARTITIONED BY ( 
  `ymd` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

Create a partition with data

insert overwrite table test_table
PARTITION (ymd = 20170101)
select
   array(named_struct('id', 'id1')) as my_array

Then I add a new field inside the sructure in the array

ALTER TABLE test_table 
CHANGE COLUMN my_array my_array array<struct<id:string, amount:int>>

This changes the metadata of the table. My expectation is that the past data is still readable, with null values for the 'amount'. Unfortunately I run into an error which I do not understand. For a better illustration let me create a new partition first:

insert overwrite table test_table
PARTITION (ymd = 20170102)
select
array(named_struct('id', 'id2', 'amount',2)) as my_array

Now, doing a

select * from test_table

yields the result which I expected (output from HUE UI):

out put from 'select * from test_table'

However, an error occurs when I want to explode the array with a lateral view this way:

select
    my_array
from
    test_table t
    lateral view explode (my_array) arry as a

This query throws a hive runtime error. The relevant piece of the logs should be the one following this paragraph. A very similar error occurs when selecting 'arry.a' instead of 'my_array'. Surprisingly, the following query runs just fine, with the result which I expected:

select
    ymd,
    a.id,
    a.amount
from
    test_table t
    lateral view explode (my_array) arry as a

enter image description here

To me it looks like this may possibly be a bug. Here is a piece of the log when running the select above which causes the error. Hive version is 1.1.0-cdh5.8.0:

Error: java.lang.RuntimeException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
Hive Runtime Error while processing row {"my_array":[{"id":"id1","amount":null}],"ymd":20170101} at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:179) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) 
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"my_array":[{"id":"id1","amount":null}],"ymd":20170101} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:507) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:170) ... 8 more 
Caused by: java.lang.UnsupportedOperationException: Cannot inspect java.util.ArrayList at org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getStructFieldsDataAsList(ArrayWritableObjectInspector.java:172) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:355) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:319) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:258) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:242) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:55) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:668) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.LateralViewJoinOperator.processOp(LateralViewJoinOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.UDTFOperator.forwardUDTFOutput(UDTFOperator.java:125) at org.apache.hadoop.hive.ql.udf.generic.UDTFCollector.collect(UDTFCollector.java:45) at org.apache.hadoop.hive.ql.udf.generic.GenericUDTF.forward(GenericUDTF.java:107) at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode.process(GenericUDTFExplode.java:94) at org.apache.hadoop.hive.ql.exec.UDTFOperator.processOp(UDTFOperator.java:108) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.LateralViewForwardOperator.processOp(LateralViewForwardOperator.java:37) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:157) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:497) ... 9 more
1

1 Answers

1
votes

We solved this problem by writing a custom SerDe which gives null values for data whenever a column can not be found in the file.