0
votes

I'm creating a dataframe in Spark and I've defined the schema as follows:

SCHEMA = StructType([StructField('s3_location', StringType()),
                     StructField('partition_date', StringType()),
                     StructField('table_name', StringType()),
                     StructField('column_name', StringType()),
                     StructField('data_type', StringType()),
                     StructField('number_of_nulls', LongType()),
                     StructField('min', DoubleType()),
                     StructField('max', DoubleType()),
                     StructField('mean', DoubleType()),
                     StructField('variance', DoubleType()),
                     StructField('max_length', LongType())])

I have a bunch of rows that follow this exact schema, and I'm creating the dataframe as follows:

DF = SPARK.createDataFrame(ROWS, schema=SCHEMA)

Then I write this dataframe to a CSV file in AWS S3:

DF.repartition(1).write.mode('append').partitionBy('partition_date').csv(SAVE_PATH,
                                                                      header=True)

This process is successful and creates the CSV file in S3. Now, I crawl this S3 location in AWS Glue and it infers the schema differently. All the fields I specified as DoubleType() are inferred as string instead. So if I want to run any aggregate functions on these values using something like QuickSight, I can't.

Why is this happening? Is there a way to fix it?

2
It doesn't seem to have anything to do with Spark. More that you're other tools cannot infer schema from csv.Alper t. Turker

2 Answers

0
votes

A CSV is an untyped file which contains text - i.e. strings.

If you tell AWS Glue that the table contains numeric values then it will read those values are numbers, but the AWS Glue crawler isn't recognizing your numeric values as such. This could be because you have a header row, or it could be because the columns are quoted, or because you didn't specify.

If you manually create the table in Glue you'll be able to specify the data type for columns. Here's how you can do that from the Athena console.

  1. Click the vertical ellipsis next to your name table, and select Generate Create Table DDL.
  2. Using the result from this query, modify data type of your numeric column in the CREATE TABLE query - you might use FLOAT, DOUBLE, or DECIMAL.
  3. Drop the table (e.g. DROP TABLE myschema.mytable;)
  4. Run the modified CREATE TABLE script. It's useful to keep all the table properties that Glue initially added, so that any downstream process understands the continues to recognize the table in the same manner.

Can you include data types in your file and avoid having to tell Glue about data types? Yes! Use one of Glue's more structured file formats, such as Parquet (Spark's favourite) or ORC.

0
votes

While Importing CSV files the Crawler auto assigns the column names. This can be fixed by:

  1. Schema Edit and save.

    Editing the table schema in the Glue console after the first crawl. Please do the save the schema with necessary data types.

  2. Change Crawler settings.

    Since you have specified that there is no change in your schema for the future runs, update/edit your crawler's output configuration options(optional) before the second run(after fixing the schema - changing to double).

  3. Select Ignore the change and don't modify data catalog.

  4. Run the crawler again. It doesn't show tables being updated or added but your data gets populated in the required format.