1
votes

I have parquet files in S3 created using different sources. They have the same schema. 1 is created using Athena CTAS. Another is created using AWS Glue/Spark.

The files created by Glue looks like:

enter image description here

Athena CTAS ones looks like:

enter image description here

I tried copying the files that are in missing partitions into another folder then use a Glue crawler and Glue can detect that. But it cannot seem to detect these partitions when everything is put together. Why is that? Do I need to process all the data using 1 method for this to work?

2
Do Athena and the Glue/Spark job write files to the same partition location?ya2410

2 Answers

2
votes

If you have added data to a new partition Glue should detect it if the schema matches.

You could try doing it manually with Athena and see if that works. Hopefully it will at least give you a helpful error.

ALTER TABLE orders ADD
  PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016'
  PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';

source: https://docs.aws.amazon.com/athena/latest/ug/alter-table-add-partition.html

You could also try loading and printing the schema for both partitions and see if something is off?

Without more specifics, Ex. examples of how you are actually partitioning, I don't think I can help much more.

You should try to come up with a more reproducible example.

1
votes

Ok, I found the issue. 2 main issues

  • Athena output bigint while spark output int
  • Some columns have different case like: countryname vs countryName

One useful tip is to either printSchema of each partition and compare using diff. Or check AWS Glue Data Catalog table partition and see the difference in partitions there.