I have multiple parquet files partitioned by day in S3. Those parquet files are all onboarded with 'string' types for each column, which I checked with the following code:
import awswrangler as wr
results = wr.s3.list_objects('s3://datalake/buildings/upload_year=*/upload_month=*/upload_day=*/*.parquet.gzip')
for result in results:
df = wr.s3.read_parquet(result)
column_not_string = sum([1 if str(x) != 'string' else 0 for x in df.dtypes])
print(column_not_string)
column_not_string
is the sum of fields that aren't string type in each file. Each iteration of a file returns 0, as expected.
These parquet files are then crawled by an AWS Glue crawler. After I run the crawler I inspect the newly created table. All Data types are 'string'. I inspect the partitions, and each field in the partition properties are all 'string'.
The problem:
When I go to Athena to query the newly created table:
select *
from "default".buildings;
I get the following error:
SQL Error [100071] [HY000]: [Simba]AthenaJDBC An error has been thrown from the AWS Athena client. HIVE_BAD_DATA: Field buildingcode's type INT64 in parquet is incompatible with type string defined in table schema
I don't understand what is going on here. If all parquet file metadata is string type, the files are crawled as strings, table is created as string, how can there be a data mismatch for an int type?