0
votes

I am learning scala-spark and want to know how can we extract required columns from an unordered data based on column name? Details below-

Input Data: RDD[Array[String]]

id=1,country=USA,age=20,name=abc
name=def,country=USA,id=2,age=30
name=ghi,id=3,age=40,country=USA

Required Output:

Name,id
abc,1
def,2
ghi,3

Any help would be much appreciated.Thanks in advance!

1

1 Answers

1
votes

If you have RDD[Array[String]] then you can get the desired data as

You can define a case class as

case class Data(Name: String, Id: Long)

Then parse each line to case class

val df = rdd.map( row => {
  //split the line and convert to map so you can extract the data
  val data = row.split(",").map(x => (x.split("=")(0),x.split("=")(1))).toMap
  Data(data("name"), data("id").toLong)
})

convert to Dataframe and display

df.toDF().show(false)

Output:

+----+---+
|Name|Id |
+----+---+
|abc |1  |
|def |2  |
|ghi |3  |
+----+---+

Here is full code to read the file

case class Data(Name: String, Id: Long)

def main(args: Array[String]): Unit = {
  val spark = SparkSession.builder().appName("xyz").master("local[*]").getOrCreate()

  import spark.implicits._
  val rdd = spark.sparkContext.textFile("path to file ")

  val df = rdd.map(row => {
    val data = row.split(",").map(x => (x.split("=")(0), x.split("=")(1))).toMap
    Data(data("name"), data("id").toLong)
  })

  df.toDF().show(false)
}