1
votes

I have data, some of each includes nests columns (arrays of arrays of objects), saved as PARQUET in Spark 2.2.

Now I'm trying to access this data externally with presto and I get following exception when I'm trying to access any nested column.

com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://name-node/parquet_path/part-00023-8d4f14b1-a3f1-4055-b931-04838701048d-c000.snappy.parquet (offset=0, length=108289): parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
    at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:220)
    at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:115)
    at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:157)
    at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:93)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
    at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:56)
    at com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:239)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:373)
    at com.facebook.presto.operator.Driver.lambda$processFor$8(Driver.java:282)
    at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:672)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:276)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:973)
    at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
    at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:477)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
 Caused by: java.lang.ClassCastException: parquet.io.PrimitiveColumnIO cannot be cast to parquet.io.GroupColumnIO
    at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:56)
    at parquet.io.ColumnIOConverter.constructField(ColumnIOConverter.java:90)
 at com.facebook.presto.hive.parquet.ParquetPageSource.<init>(ParquetPageSource.java:109)

What is interesting that I'm able to query other non nested columns without any issues.

Create table looks like following:

CREATE TABLE hive.tests.table_name (
not_nested_field_1 BIGINT,
not_nested_field_2 BIGINT,
not_nested_field_3 BOOLEAN,
not_nested_field_4 DOUBLE,
not_nested_field_5 ARRAY(VARCHAR),
not_nested_field_5 ARRAY(ROW(
    nested_level0_field1 BOOLEAN,
    nested_level0_field2 BIGINT,
    nested_level0_field3 BIGINT,
    nested_level0_field4 ARRAY(ROW(
        nested_level1_field1 BOOLEAN,
        nested_level1_field2 BIGINT,
        nested_level1_field3 VARCHAR,
        nested_level1_field4 ARRAY(ROW(
            nested_level2_field1 VARCHAR,
            nested_level2_field2 VARCHAR,
            nested_level2_field3 ARRAY(ROW(
                nested_level3_field1 VARCHAR,
                nested_level3_field2 VARCHAR)))),
        nested_level1_field5 ARRAY(ROW(
            nested_level2_field4 BIGINT,
            nested_level2_field5 BIGINT,
            nested_level2_field6 ARRAY(ROW(
                nested_level3_field3 VARCHAR,
                nested_level3_field4 VARCHAR)))))))))
WITH (
  format = 'PARQUET',
  external_location = 'hdfs://name-node/parquet_path/'
);

Using presto version 0.208, using local Hive metastore for creating external tables.

Any help would be appreciated :)

1

1 Answers

6
votes

The issue was resolved with hive.parquet.use-column-names=true property defined in catalog/hive.properties

By default presto will use column indexes to access data so need define explicitly this property so it will use column names in parquet as defined in CREATE TABLE.