2
votes

Hi I am trying to split a column in spark RDD.

Data set sample:

twitter data

Here I want to split the Month column to a Month and a year: Example:

2019 10

2009 11

and further count all the tweets in a year.(I know how to use reduceByKey(+) here)

How do I split columns in Spark RDD? I don't want to use Data frames.

1
You can use map function, split the string by length (year is first 4 chars, month is next two) and return a tuple of (month, year).Rayan Ral

1 Answers

0
votes

You can try as follow

val rdd = oldRdd.map({case(tokenType,month,count,hashTagName) => (tokenType,month.substring(0,4),month.substring(2,6),count,hashTagName)})