Not valid Parquet file: AWS Glue Catalog Table doesn't work with Snappy files

Question

Using a configuration identic to the one used in the Terraform example: https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/glue_catalog_table

resource "aws_glue_catalog_table" "aws_glue_catalog_table" {
  name          = "MyCatalogTable"
  database_name = "MyCatalogDatabase"

  table_type = "EXTERNAL_TABLE"

  parameters = {
    EXTERNAL              = "TRUE"
    "parquet.compression" = "SNAPPY"
  }

  storage_descriptor {
    location      = "s3://my-bucket/event-streams/my-stream"
    input_format  = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"

    ser_de_info {
      name                  = "my-stream"
      serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"

      parameters = {
        "serialization.format" = 1
      }
    }

  }
}

and then trying to run a simple Athena query on the created table fails with the error

Not valid Parquet file

I've used every SerDe definition available: https://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html And all the input_formats I could find, and nothing works.

Trying it with a Parquet file instead of a Snappy file does seem to work, but that doesn't fit my needs. Anyone ever had this working with Snappy files?

I just ran this github.com/prabhacloud/terraform-aws/blob/master/main.tf and was able to read snappy compressed file github.com/prabhacloud/terraform-aws/blob/master/… . If this is not working can you confirm how did you compressed your parquet file? — Prabhakar Reddy
Your configuration works with your file, not with mine. Mine is being compressed with Firehose with Snappy for S3 Compression. — smjm
Can you upload a sample parquet of yours to some public repo and share it here ? try Github or any other repo — Prabhakar Reddy
the file that you attaches is not a parquet file and it is a CSV file. You should use different serde. Let me write an answer. — Prabhakar Reddy

Robert Kossendey Robert Kossendey · Accepted Answer · 2021-05-07T14:52:20

0

votes

You need to set the compressed flag, since the Parquet files are compressed with Snappy.

Not valid Parquet file: AWS Glue Catalog Table doesn't work with Snappy files

1 Answers