I am trying to do some sentiment analysis of some review data using a bag of positive and Negative words in Apache Spark (Using Scala). I am new to Scala so need some help. The program is given below:
Read the positive/negative in RDDs.
val pos_words = sc.textFile("D:/spark4/mydata/pos-words.txt")
val neg_words = sc.textFile("D:/spark4/mydata/neg-words.txt")
Read the reviews into an RDD
val dataFile = sc.textFile("D:/spark4/mydata/review_data.txt")
val reviews = dataFile.map(_.replaceAll("[^a-zA-Z\\s]", "").trim().toLowerCase())
Flatmap the reviews into individual words
val words = reviews.flatMap(_.split(" "))
Now is there a way I can use pos_words and neg_words within a Map function of words RDD and assign a count of all the positive words and Negative words against each record of the Reviews RDD.
Desired Output would be
<Review Text 1>,<#PosWordCount>,<#NegWordCount>
xxxxxxxxxxxxxx,20,10
yyyyyyyyyyyyyy,5,30
Any help would be greatly appreciated.