0
votes

I have multiple parquet files partitioned by day in S3. Those parquet files are all onboarded with 'string' types for each column, which I checked with the following code:

import awswrangler as wr
results = wr.s3.list_objects('s3://datalake/buildings/upload_year=*/upload_month=*/upload_day=*/*.parquet.gzip')
for result in results:
    df = wr.s3.read_parquet(result)
        column_not_string = sum([1 if str(x) != 'string' else 0 for x in df.dtypes])
        print(column_not_string)

column_not_string is the sum of fields that aren't string type in each file. Each iteration of a file returns 0, as expected.

These parquet files are then crawled by an AWS Glue crawler. After I run the crawler I inspect the newly created table. All Data types are 'string'. I inspect the partitions, and each field in the partition properties are all 'string'.

The problem:

When I go to Athena to query the newly created table:

select *
from "default".buildings;

I get the following error:

SQL Error [100071] [HY000]: [Simba]AthenaJDBC An error has been thrown from the AWS Athena client. HIVE_BAD_DATA: Field buildingcode's type INT64 in parquet is incompatible with type string defined in table schema

I don't understand what is going on here. If all parquet file metadata is string type, the files are crawled as strings, table is created as string, how can there be a data mismatch for an int type?

1

1 Answers

0
votes

While I was poking around in the table partition properties in Glue (AWS Glue > Tables > [your table] > View Partitions > View properties) I noticed that the 'objectCount' property was 1 on some partitions and > 1 on others. I noticed the partitions that had > 1 were also the partitions Athena was erroring on if I filtered my SQL down to that particular partition.

I inspected the S3 location representative of that partition and noticed I had multiple objects where I was only expecting 1. The extra object was a parquet file that had data types that weren't string. Lesson learned.