0
votes

I want to test Presto performance on local TPCH data encoded in Parquet format.

I have TPCH tables encoded in Parquet stored under folder /home/data/tpch, and create table in presto as follows

create table hive.tpch_5.region 
       (regionkey int, name varchar, r_comment varchar) 
        with (format= 'PARQUET', external_location = 'file:///home/data/tpch/');

Selecting from regionkey works well. But selecting name leading to the following error:

Query 20191014_020453_00040_bfdq8 failed: The column name is declared as type string, but the Parquet file declares the column as type INT32

However the column name is BINARY. Here's the output from parquet-tools

file schema: region 
-------------------------------------------------------------------------------------------------------------------
region_key:  REQUIRED INT32 R:0 D:0
name:        REQUIRED BINARY R:0 D:0
comment:     REQUIRED BINARY R:0 D:0

row group 1: RC:5 TS:712 OFFSET:4 
-------------------------------------------------------------------------------------------------------------------
region_key:   INT32 UNCOMPRESSED DO:0 FPO:4 SZ:43/43/1.00 VC:5 ENC:DELTA_BINARY_PACKED,BIT_PACKED
name:         BINARY UNCOMPRESSED DO:0 FPO:47 SZ:120/120/1.00 VC:5 ENC:DELTA_BYTE_ARRAY,BIT_PACKED
comment:      BINARY UNCOMPRESSED DO:0 FPO:167 SZ:549/549/1.00 VC:5 ENC:DELTA_BYTE_ARRAY,BIT_PACKED

Any help is greatly appreciated!

Here's my hive.properties

connector.name=hive-hadoop2
hive.metastore=file
hive.metastore.catalog.dir=file:///home/harper/presto/hive-catalog
hive.metastore.user=harper
hive.allow-drop-table=true
hive.parquet.use-column-names=true          

Update 10/14

I debug the presto server and found the root cause. The error was thrown from ParquetPageSourceFactory.getParquetType, where I found that instead of reading the schema of region, Presto read the schema from table lineitem.parquet. It turns out as I put all tpch tables under the same directory, Presto did not fetch the file by table name, instead it assumes all files under that folder belongs to the same table.

Solution

Create separate folders for each table, and move the files to different directory.

1

1 Answers

0
votes

There has to be some compatibility between Parquet columns types and fields types defined in your DDL. You have used different column names in your DDL, that might be an issue as well. Try changing field names in DDL to the ones defined in Parquet file.