I want to test Presto performance on local TPCH data encoded in Parquet format.
I have TPCH tables encoded in Parquet stored under folder /home/data/tpch
, and create table in presto as follows
create table hive.tpch_5.region
(regionkey int, name varchar, r_comment varchar)
with (format= 'PARQUET', external_location = 'file:///home/data/tpch/');
Selecting from regionkey
works well. But selecting name leading to the following error:
Query 20191014_020453_00040_bfdq8 failed: The column name is declared as type string, but the Parquet file declares the column as type INT32
However the column name
is BINARY. Here's the output from parquet-tools
file schema: region
-------------------------------------------------------------------------------------------------------------------
region_key: REQUIRED INT32 R:0 D:0
name: REQUIRED BINARY R:0 D:0
comment: REQUIRED BINARY R:0 D:0
row group 1: RC:5 TS:712 OFFSET:4
-------------------------------------------------------------------------------------------------------------------
region_key: INT32 UNCOMPRESSED DO:0 FPO:4 SZ:43/43/1.00 VC:5 ENC:DELTA_BINARY_PACKED,BIT_PACKED
name: BINARY UNCOMPRESSED DO:0 FPO:47 SZ:120/120/1.00 VC:5 ENC:DELTA_BYTE_ARRAY,BIT_PACKED
comment: BINARY UNCOMPRESSED DO:0 FPO:167 SZ:549/549/1.00 VC:5 ENC:DELTA_BYTE_ARRAY,BIT_PACKED
Any help is greatly appreciated!
Here's my hive.properties
connector.name=hive-hadoop2
hive.metastore=file
hive.metastore.catalog.dir=file:///home/harper/presto/hive-catalog
hive.metastore.user=harper
hive.allow-drop-table=true
hive.parquet.use-column-names=true
Update 10/14
I debug the presto server and found the root cause. The error was thrown from ParquetPageSourceFactory.getParquetType
, where I found that instead of reading the schema of region, Presto read the schema from table lineitem.parquet
. It turns out as I put all tpch tables under the same directory, Presto did not fetch the file by table name, instead it assumes all files under that folder belongs to the same table.
Solution
Create separate folders for each table, and move the files to different directory.