0
votes

After unloading the data in parquet format to s3 from Snowflake, then I am creating a Hive table with the equivalent snowflake schema, with s3 as the external location, but I am getting the following error:

Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1

Looks like Snowflake unload spark job does not support legacy format. The issue is described here.

Snowflake schema:

create or replace TABLE table_name cluster by ( COL_1, COL_2 )(
    COL_1 VARCHAR(16777216) NOT NULL,
    COL_2 NUMBER(38,0),
    COL_3 VARCHAR(16777216),
    COL_4 VARCHAR(16777216),
    COL_5 VARCHAR(16777216),
    COL_6 NUMBER(38,0),
    COL_7 TIMESTAMP_NTZ(0),
    COL_8 VARCHAR(16777216),
    COL_9 VARCHAR(16777216),
    COL_10 NUMBER(38,0),
    COL_11 NUMBER(38,0)
);

Hive schema:

CREATE EXTERNAL TABLE `DEAL_COUNTRY_MAP`(
  `COL_1` string,
  `COL_2` decimal(38,0),
  `COL_3` string,
  `COL_4` string,
  `COL_5` string,
  `COL_6` decimal(38,0),
  `COL_7` timestamp,
  `COL_8` string,
  `COL_9` string,
  `COL_10` decimal(38,0),
  `COL_11` decimal(38,0))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://schemaname.tablename/20200626/data'
TBLPROPERTIES ('parquet.compress'='SNAPPY');

In Hive schema, instead of string I have also tried varchar and that didn’t worked either.

1
The issue that I found is, snowflake is export column name information in uppercase whereas HIVE is using lowercase column name.Vishrant

1 Answers

0
votes

I am not sure as right now SF has a solution to generate what is called "LegacyFormat" of Parquet file in Spark.

For our business use case, we always have a transforming spark job after the file uploaded from the Snowflake (As currently it CANNOT dump the timestamp type columns etc), so we have this transforming logic for the parquet files dumped from SF to DataLake usage

  • Handle the timestamp type columns
  • Make sure all the column names are transforming into lower case for AWS Glue
  • Make sure "LegacyFormat"