pyspark: Specified data types in schema while creating dataframe don't reflect in data

Question

I'm creating a dataframe in Spark and I've defined the schema as follows:

SCHEMA = StructType([StructField('s3_location', StringType()),
                     StructField('partition_date', StringType()),
                     StructField('table_name', StringType()),
                     StructField('column_name', StringType()),
                     StructField('data_type', StringType()),
                     StructField('number_of_nulls', LongType()),
                     StructField('min', DoubleType()),
                     StructField('max', DoubleType()),
                     StructField('mean', DoubleType()),
                     StructField('variance', DoubleType()),
                     StructField('max_length', LongType())])

I have a bunch of rows that follow this exact schema, and I'm creating the dataframe as follows:

DF = SPARK.createDataFrame(ROWS, schema=SCHEMA)

Then I write this dataframe to a CSV file in AWS S3:

DF.repartition(1).write.mode('append').partitionBy('partition_date').csv(SAVE_PATH,
                                                                      header=True)

This process is successful and creates the CSV file in S3. Now, I crawl this S3 location in AWS Glue and it infers the schema differently. All the fields I specified as DoubleType() are inferred as string instead. So if I want to run any aggregate functions on these values using something like QuickSight, I can't.

Why is this happening? Is there a way to fix it?

It doesn't seem to have anything to do with Spark. More that you're other tools cannot infer schema from csv. — Alper t. Turker

Kirk Broadhurst Kirk Broadhurst · Accepted Answer · 2018-02-05T18:11:05

A CSV is an untyped file which contains text - i.e. strings.

If you tell AWS Glue that the table contains numeric values then it will read those values are numbers, but the AWS Glue crawler isn't recognizing your numeric values as such. This could be because you have a header row, or it could be because the columns are quoted, or because you didn't specify.

If you manually create the table in Glue you'll be able to specify the data type for columns. Here's how you can do that from the Athena console.

Click the vertical ellipsis next to your name table, and select Generate Create Table DDL.
Using the result from this query, modify data type of your numeric column in the CREATE TABLE query - you might use FLOAT, DOUBLE, or DECIMAL.
Drop the table (e.g. DROP TABLE myschema.mytable;)
Run the modified CREATE TABLE script. It's useful to keep all the table properties that Glue initially added, so that any downstream process understands the continues to recognize the table in the same manner.

Can you include data types in your file and avoid having to tell Glue about data types? Yes! Use one of Glue's more structured file formats, such as Parquet (Spark's favourite) or ORC.

pyspark: Specified data types in schema while creating dataframe don't reflect in data

2 Answers