2
votes

I am reading csv in spark using below statement.

df = spark.read.csv('<CSV FILE>', header=True, inferSchema = True)

When I am checking in spark dataframe some of the integer and double columns are stored as string column in dataframe. However, this is not the case with all the columns.

I have checked the values of particular column and all the values are of double type but still spark is inferring as StringType.

Since I am loading CSV file with around 1000 columns it is not feasible to specify the schema explicitly as well.

Any suggestions/help would be highly appreciated.

Regards,

Neeraj

1
You can just cast it later to your required datatype. df.withColumn("a", col("a").cast(DecimalType(10,2) ) or whatever.philantrovert
Please add sample data, which are not parsed correctlyT. Gawęda
I can cast the columns explicitly, but in order to do that I need to compare the 1000 columns with their actual datatype and data type in spark. However, I do not want to do this exercise because otherwise I need to do this time consuming exercise again and again. I can't append the sample data here. However, I have checked the column which contains all double values which it is taking as string. My double values have value upto 6 place decimal. will that be an issue. if yes how we can specify the decimal number while reading csv in spark.neeraj bhadani

1 Answers

0
votes

Considering the schema doesn't change mid-table (float will remain floats through the rows). You can write a small script to cast them automatically:

def isfloat(x):
    try:
        float(x)
    except :
        return False
    else:
        return True
line1 = df.head(1)[0]

df = df.select([c for c in df.columns if not isfloat(line1[c])] + [df[c].cast("float").alias(c) for c in df.columns if isfloat(line1[c])])

If you don't think there is enough information in the first line, you can do this

N = 10
def isfloat(c):
    try:
        [float(x) in c]
    except :
        return False
    else:
        return True
Nlines = df.limit(N).toPandas()
df = df.select([c for c in df.columns if not isfloat(Nlines[c])] + [df[c].cast("float").alias(c) for c in df.columns if isfloat(Nlines[c])])