After unloading the data in parquet format to s3 from Snowflake, then I am creating a Hive table with the equivalent snowflake schema, with s3 as the external location, but I am getting the following error:
Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1
Looks like Snowflake unload spark job does not support legacy format. The issue is described here.
Snowflake schema:
create or replace TABLE table_name cluster by ( COL_1, COL_2 )(
COL_1 VARCHAR(16777216) NOT NULL,
COL_2 NUMBER(38,0),
COL_3 VARCHAR(16777216),
COL_4 VARCHAR(16777216),
COL_5 VARCHAR(16777216),
COL_6 NUMBER(38,0),
COL_7 TIMESTAMP_NTZ(0),
COL_8 VARCHAR(16777216),
COL_9 VARCHAR(16777216),
COL_10 NUMBER(38,0),
COL_11 NUMBER(38,0)
);
Hive schema:
CREATE EXTERNAL TABLE `DEAL_COUNTRY_MAP`(
`COL_1` string,
`COL_2` decimal(38,0),
`COL_3` string,
`COL_4` string,
`COL_5` string,
`COL_6` decimal(38,0),
`COL_7` timestamp,
`COL_8` string,
`COL_9` string,
`COL_10` decimal(38,0),
`COL_11` decimal(38,0))
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3a://schemaname.tablename/20200626/data'
TBLPROPERTIES ('parquet.compress'='SNAPPY');
In Hive schema, instead of string
I have also tried varchar
and that didn’t worked either.