I'm creating a dataframe in Spark and I've defined the schema as follows:
SCHEMA = StructType([StructField('s3_location', StringType()),
StructField('partition_date', StringType()),
StructField('table_name', StringType()),
StructField('column_name', StringType()),
StructField('data_type', StringType()),
StructField('number_of_nulls', LongType()),
StructField('min', DoubleType()),
StructField('max', DoubleType()),
StructField('mean', DoubleType()),
StructField('variance', DoubleType()),
StructField('max_length', LongType())])
I have a bunch of rows that follow this exact schema, and I'm creating the dataframe as follows:
DF = SPARK.createDataFrame(ROWS, schema=SCHEMA)
Then I write this dataframe to a CSV file in AWS S3:
DF.repartition(1).write.mode('append').partitionBy('partition_date').csv(SAVE_PATH,
header=True)
This process is successful and creates the CSV file in S3. Now, I crawl this S3 location in AWS Glue and it infers the schema differently. All the fields I specified as DoubleType()
are inferred as string
instead. So if I want to run any aggregate functions on these values using something like QuickSight, I can't.
Why is this happening? Is there a way to fix it?