Proper Schema for Spark (when loading data into Dataframe)

Question

I have following structure in RedShift after running describe on a table (all fields are Nullable):

a integer
b numeric(18)
c date
d char(3)
e smallint
f char(1)
g varchar(20)
h numeric(11,2)

All data is extracted to S3. Now want to load data into Spark Dataframe but need to create a proper schema for this table as well.

How would Spark schema look like for these fields?

Is this structure correct? (wondering specially about Numeric (11,2), Date, Char(1) fields)

val schema = StructType( 
    Array( 
        StructField("a", IntegerType, true), 
        StructField("b", IntegerType, true), 
        StructField("c", StringType, true),
        StructField("d", StringType, true),
        StructField("e", IntegerType, true),
        StructField("f", StringType, true),
        StructField("g", StringType, true),
        StructField("h", IntegerType, true)
    ) 
)

T. Gawęda T. Gawęda · Accepted Answer · 2017-02-07T15:56:04

You should use :

DoubleType or DecimalType for float value (like NUMERIC(11,2)). Decimal is better in my opinion, as it operates on BigDecimals
LongType for very big numbers - like NUMERIC(18). Otherwise it will not be stored properly
DateType for dates - it can be stored as a string, but if you can, you should choose more meaningful type

Proper Schema for Spark (when loading data into Dataframe)

1 Answers