Given Input in a file as:
Maths,K1,A1,K2,A2,K3,A4
Physics,L6,M1,L5,M2,L9,M2
Using Spark and Scala, How can I extract key value pairs as RDD as shown below:
Maths, K1
Maths, K2
Maths, K3
Physics, L6
Physics, L5
Physics, L9
Given Input in a file as:
Maths,K1,A1,K2,A2,K3,A4
Physics,L6,M1,L5,M2,L9,M2
Using Spark and Scala, How can I extract key value pairs as RDD as shown below:
Maths, K1
Maths, K2
Maths, K3
Physics, L6
Physics, L5
Physics, L9
To create a Spark data frame with your data, you can proceed as follows
// If the examples were lists of items
val l1 = List("Maths", "K1", "A1", "K2", "A2", "K3", "A4")
// If they were strings, you can proceed like this
val l2 = "Physics,L6,M1,L5,M2,L9,M2".split(",").toSeq
// toDF() takes a sequence of tuples, which we now can create from our list(s)
val res = l1.tail.map(l1.head -> _).toDF("Subject", "Code")
.union(l2.tail.map(l2.head -> _).toDF("Subject", "Code"))
// If the filtering in your example was intentional
res.filter("Code not like 'A%' and code not like 'M%'").show
+-------+----+
|Subject|Code|
+-------+----+
| Maths| K1|
| Maths| K2|
| Maths| K3|
|Physics| L6|
|Physics| L5|
|Physics| L9|
+-------+----+
Assuming we can safely deduce the expected outcome from the two samples in your question, and assuming input is a sequence of strings, here is one way to achieve it:
val s = List("Maths,K1,A1,K2,A2,K3,A4","Physics,L6,M1,L5,M2,L9,M2")
val df = s.flatMap(x => {
val t = x.split(",")
(1 until t.size by 2).map(t.head -> t(_))
}).toDF("C1", "C2")
Result data frame:
+-------+---+
| C1| C2|
+-------+---+
| Maths| K1|
| Maths| K2|
| Maths| K3|
|Physics| L6|
|Physics| L5|
|Physics| L9|
+-------+---+