1
votes

I tried to create external table from s3 parquet files(created with spark saved to s3) in Athena. Whenever the data type is decimal or timestamp, no data shows up after

select "column" from athena_table

The code I used to create external table is as follows

CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
`date` timestamp, 
`number` decimal(38, 0)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
LOCATION 's3 path'
TBLPROPERTIES (
  'has_encrypted_data'='true'
);

What I have tried:

  1. Replace decimal(38, 0) in above code with double, doesn't work, nothing shows up in that column
  2. Replace decimal(38, 0) or timestamp in above code with string, doesn't work, garbled character shows up
  3. Update the data type in my s3 parquet file in above two formats (double or string) and resave the file, doesn't work, error

    HIVE_BAD_DATA: Field 'number' type FIXED_LEN_BYTE_ARRAY in parquet is incompatible with type double defined in table schema`

1
It is not advised to use generic column names such as 'date', 'number' or 'string'. It is often reserved word in one language or another. Use more meaningful names such as 'start_date' or 'order_number'. - Guy

1 Answers

0
votes

Try to use AWS Wrangler to write the Dataframe as a parquet file and register it as a table in Glue and Athena.

import awswrangler as wr

dataframe.write \
        .mode("overwrite") \
        .format("parquet") \
        .partitionBy(["year", "month"]) \
        .save(compression="gzip", path="s3://...")
sess = wr.Session(spark_session=spark)
sess.spark.create_glue_table(
    dataframe=dataframe,
    file_format="parquet",
    partition_by=["year", "month"],
    path="s3://...",
    compression="gzip",
    database="my_database")