0
votes

I am implementing the buildScan method of Spark Data Source API v1.

override def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] =

I am trying to read a .csv file with headers.

val df = sqlContext.sparkSession.read
     .schema(_schema_)
     .option("header", "true")
     .csv(_array_pf_paths_)

and returning it as an rdd

df.rdd

The schema is as follows:

root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
|-- D: string (nullable = true)
|-- E: string (nullable = true)
|-- F: string (nullable = true)

Headers appear in the header position as well as the first row

df.show()

-----------------------
A   B   C   D   E   F
-----------------------
A   B   C   D   E   F
a1  b1  c1  d1  e1  f1 
a2  b2  c2  d2  e2  f2 
a3  b3  c3  d3  e3  f3 
a4  b4  c4  d4  e4  f4 
a5  b5  c5  d5  e5  f5 
------------------------

Once the rdd is returned as I do

df.select(F) or df.select(E) 

always the first column is returned.

---
A   
---
A   
a1
a2
a3
a4
a5
---

but df.show inside of buildScan() returns the correct column.

I am not able to find where exactly the column mapping is going wrong.

1

1 Answers

0
votes

you have to use double quote in column name like df.select("D").show()