I am implementing the buildScan method of Spark Data Source API v1.
override def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] =
I am trying to read a .csv file with headers.
val df = sqlContext.sparkSession.read
.schema(_schema_)
.option("header", "true")
.csv(_array_pf_paths_)
and returning it as an rdd
df.rdd
The schema is as follows:
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)
|-- E: string (nullable = true)
|-- F: string (nullable = true)
Headers appear in the header position as well as the first row
df.show()
-----------------------
A B C D E F
-----------------------
A B C D E F
a1 b1 c1 d1 e1 f1
a2 b2 c2 d2 e2 f2
a3 b3 c3 d3 e3 f3
a4 b4 c4 d4 e4 f4
a5 b5 c5 d5 e5 f5
------------------------
Once the rdd is returned as I do
df.select(F) or df.select(E)
always the first column is returned.
---
A
---
A
a1
a2
a3
a4
a5
---
but df.show inside of buildScan() returns the correct column.
I am not able to find where exactly the column mapping is going wrong.