0
votes

I have just started learning Spark. I am aware of the fact that if we set inferSchema option to true, the schema is automatically inferred. I am reading a simple csv file. How do i dynamically infer a schema without specifying any custom schema in my code. The code should be able to build schema for any incoming dataset.

Is it possible to do so?

I tried using readStream and specified my format as csv skipping the inferschema option altogether but it seems i need to provide that option in any case.

 val ds1: DataFrame = spark
    .readStream
    .format("csv")
    .load("/home/vaibha/Downloads/C2ImportCalEventSample.csv")
  println(ds1.show(2))

1

1 Answers

1
votes

You can dynamically infer schema but might get bit tedious in some cases of csv format. More read here. Referring to CSV file in your code sample and assuming it is same as the one here, something like below will give what you need:

scala> val df = spark.read.
 | option("header", "true").
 | option("inferSchema", "true").
 | option("timestampFormat","MM/dd/yyyy").
 | csv("D:\\texts\\C2ImportCalEventSample.csv")

df: org.apache.spark.sql.DataFrame = [Start Date : timestamp, Start Time: string ... 15 more fields]

scala> df.printSchema
root
 |-- Start Date : timestamp (nullable = true)
 |-- Start Time: string (nullable = true)
 |-- End Date: timestamp (nullable = true)
 |-- End Time: string (nullable = true)
 |-- Event Title : string (nullable = true)
 |-- All Day Event: string (nullable = true)
 |-- No End Time: string (nullable = true)
 |-- Event Description: string (nullable = true)
 |-- Contact : string (nullable = true)
 |-- Contact Email: string (nullable = true)
 |-- Contact Phone: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Category: integer (nullable = true)
 |-- Mandatory: string (nullable = true)
 |-- Registration: string (nullable = true)
 |-- Maximum: integer (nullable = true)
 |-- Last Date To Register: timestamp (nullable = true)