I'm trying to create a dataframe in Spark by reading a csv, the problem is that if I don't do anything in particular the dataframe will have every column type as string:
|-- ticker: string (nullable = true)
|-- open: string (nullable = true)
|-- close: string (nullable = true)
|-- adj_close: string (nullable = true)
|-- low: string (nullable = true)
|-- high: string (nullable = true)
|-- volume: string (nullable = true)
|-- date: string (nullable = true)
In order to solve this I add the option "inferSchema" as true, like this:
val spark = SparkSession.builder
.appName("Job One")
.config("spark.eventLog.enabled", "true")
.config("spark.eventLog.dir", spark_events)
import spark.implicits._
val df = spark.read
.option("inferSchema", "true")
.option("header", "true")
.option("mode", "DROPMALFORMED")
And this way I obtain this instead:
|-- ticker: string (nullable = true)
|-- open: double (nullable = true)
|-- close: double (nullable = true)
|-- adj_close: double (nullable = true)
|-- low: double (nullable = true)
|-- high: double (nullable = true)
|-- volume: long (nullable = true)
|-- date: string (nullable = true)
Which is what I want, but adding the option inferSchema makes so that the job takes 1.4 minutes instead of just 6 seconds when I don't add it. Another method to obtain the columns with the types I want is by using withColumn, like this:
val df2 = df
The result of the whole operation this time is just 6 seconds again. What gives?