Extract Key Value Pairs from Input Using Scala, Spark

Question

Given Input in a file as:

Maths,K1,A1,K2,A2,K3,A4
Physics,L6,M1,L5,M2,L9,M2

Using Spark and Scala, How can I extract key value pairs as RDD as shown below:

Maths, K1
Maths, K2
Maths, K3
Physics, L6
Physics, L5
Physics, L9

Are the inputs two different lists of values or just strings? Are A2, A4, M1... filtered out on purpose? By which criteria? — Lars Skaug

Lars Skaug Lars Skaug · Accepted Answer · 2020-09-16T20:34:27

To create a Spark data frame with your data, you can proceed as follows

// If the examples were lists of items
val l1 = List("Maths", "K1", "A1", "K2", "A2", "K3", "A4")

// If they were strings, you can proceed like this
val l2 = "Physics,L6,M1,L5,M2,L9,M2".split(",").toSeq 

// toDF() takes a sequence of tuples, which we now can create from our list(s)
val res = l1.tail.map(l1.head -> _).toDF("Subject", "Code")
          .union(l2.tail.map(l2.head -> _).toDF("Subject", "Code"))

// If the filtering in your example was intentional
res.filter("Code not like 'A%' and code not like 'M%'").show

+-------+----+
|Subject|Code|
+-------+----+
|  Maths|  K1|
|  Maths|  K2|
|  Maths|  K3|
|Physics|  L6|
|Physics|  L5|
|Physics|  L9|
+-------+----+

Extract Key Value Pairs from Input Using Scala, Spark

2 Answers