2
votes

I have following structure in RedShift after running describe on a table (all fields are Nullable):

a integer
b numeric(18)
c date
d char(3)
e smallint
f char(1)
g varchar(20)
h numeric(11,2)

All data is extracted to S3. Now want to load data into Spark Dataframe but need to create a proper schema for this table as well.

How would Spark schema look like for these fields?

Is this structure correct? (wondering specially about Numeric (11,2), Date, Char(1) fields)

val schema = StructType( 
    Array( 
        StructField("a", IntegerType, true), 
        StructField("b", IntegerType, true), 
        StructField("c", StringType, true),
        StructField("d", StringType, true),
        StructField("e", IntegerType, true),
        StructField("f", StringType, true),
        StructField("g", StringType, true),
        StructField("h", IntegerType, true)
    ) 
) 
1

1 Answers

3
votes

You should use :

  • DoubleType or DecimalType for float value (like NUMERIC(11,2)). Decimal is better in my opinion, as it operates on BigDecimals
  • LongType for very big numbers - like NUMERIC(18). Otherwise it will not be stored properly
  • DateType for dates - it can be stored as a string, but if you can, you should choose more meaningful type