How can I read a dataframe using spark streaming with it's schema that I specify

Question

I'm trying to read csv file into dataframe from AWS S3 using Spark streaming however the data is not getting stored in the desired columns instead they enter in just 1 column and others are null. Need a way how I can take the csv file as input in format.

I have tried adding the schema. Removing the schema and trying to infer schema states it is mandatory to specify the schema.

var schema = StructType(
  StructField("date", StringType, true) ::
    StructField("close",StringType, true) ::
    StructField("volume", StringType, true) ::
    StructField("open", StringType, true) ::
    StructField("high",StringType,true) ::
    StructField("low", StringType,true) :: Nil)

val ds = spark
  .readStream
  .option("sep", ";")
  .format("csv")
  .option("thousands",",")
  .schema(schema)
  .option("header",true)
  .load(path)

val df = ds.select("*")

df.writeStream.outputMode("append")
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .start("/home/admin1/IdeaProjects/StockPricePrediction/src/main/output/")
  .awaitTermination()

I was expected a dataframe with data in each column however it shows as something as below:

Batch: 0
-------------------------------------------
19/07/02 18:53:46 INFO CodeGenerator: Code generated in 20.170544 ms
+--------------------+-----+------+----+----+----+
|                date|close|volume|open|high| low|
+--------------------+-----+------+----+----+----+
|0,2019/06/28,1080...| null|  null|null|null|null|
|1,2019/06/27,1076...| null|  null|null|null|null|
|2,2019/06/26,1079...| null|  null|null|null|null|
|3,2019/06/25,1086...| null|  null|null|null|null|
|4,2019/06/24,1115...| null|  null|null|null|null|
+--------------------+-----+------+----+----+----+

Any help will be appreciated. Thank you

Looks like you data is , separated where as you mentioned ; as delimiter. — Abhinay

Wei Chen Wei Chen · Accepted Answer · 2019-07-03T02:30:43

It looked like your delimiter is not properly set. Since all data seem to be clustered in the date column.

.option("delimiter", ",")

How can I read a dataframe using spark streaming with it's schema that I specify

1 Answers