0
votes

I'm trying to read csv file into dataframe from AWS S3 using Spark streaming however the data is not getting stored in the desired columns instead they enter in just 1 column and others are null. Need a way how I can take the csv file as input in format.

I have tried adding the schema. Removing the schema and trying to infer schema states it is mandatory to specify the schema.

var schema = StructType(
  StructField("date", StringType, true) ::
    StructField("close",StringType, true) ::
    StructField("volume", StringType, true) ::
    StructField("open", StringType, true) ::
    StructField("high",StringType,true) ::
    StructField("low", StringType,true) :: Nil)

val ds = spark
  .readStream
  .option("sep", ";")
  .format("csv")
  .option("thousands",",")
  .schema(schema)
  .option("header",true)
  .load(path)

val df = ds.select("*")

df.writeStream.outputMode("append")
  .format("console")
  .trigger(Trigger.ProcessingTime("5 seconds"))
  .start("/home/admin1/IdeaProjects/StockPricePrediction/src/main/output/")
  .awaitTermination()

I was expected a dataframe with data in each column however it shows as something as below:

Batch: 0
-------------------------------------------
19/07/02 18:53:46 INFO CodeGenerator: Code generated in 20.170544 ms
+--------------------+-----+------+----+----+----+
|                date|close|volume|open|high| low|
+--------------------+-----+------+----+----+----+
|0,2019/06/28,1080...| null|  null|null|null|null|
|1,2019/06/27,1076...| null|  null|null|null|null|
|2,2019/06/26,1079...| null|  null|null|null|null|
|3,2019/06/25,1086...| null|  null|null|null|null|
|4,2019/06/24,1115...| null|  null|null|null|null|
+--------------------+-----+------+----+----+----+

Any help will be appreciated. Thank you

1
How does your input data look like?Rakshith
Looks like you data is , separated where as you mentioned ; as delimiter.Abhinay

1 Answers

0
votes

It looked like your delimiter is not properly set. Since all data seem to be clustered in the date column.

.option("delimiter", ",")