0
votes

I am trying to do some sentiment analysis of some review data using a bag of positive and Negative words in Apache Spark (Using Scala). I am new to Scala so need some help. The program is given below:

Read the positive/negative in RDDs.

val pos_words = sc.textFile("D:/spark4/mydata/pos-words.txt")
val neg_words = sc.textFile("D:/spark4/mydata/neg-words.txt")

Read the reviews into an RDD

val dataFile = sc.textFile("D:/spark4/mydata/review_data.txt")
val reviews = dataFile.map(_.replaceAll("[^a-zA-Z\\s]", "").trim().toLowerCase())

Flatmap the reviews into individual words

val words = reviews.flatMap(_.split(" "))

Now is there a way I can use pos_words and neg_words within a Map function of words RDD and assign a count of all the positive words and Negative words against each record of the Reviews RDD.

Desired Output would be

<Review Text 1>,<#PosWordCount>,<#NegWordCount>

xxxxxxxxxxxxxx,20,10

yyyyyyyyyyyyyy,5,30

Any help would be greatly appreciated.

1

1 Answers

0
votes

To do this, you need to distribute your positive and negative dictionaries to all executors in the cluster. There are like to be small and fit in memory. I'm assuming your reviews might be a much larger RDD, that you want to distribute. So:

  1. Fetch your dictionaries to a set via pos_words.collect().asSet.
  2. Transform into a broadcast variable. Docs here.
  3. Build a small, normal Scala function that takes a review and the 2 sets, iterates over all words and keeps a count of positives and negatives. Return the desired tuple. This is basic programming.
  4. Use your function for all reviews using reviews.map(f)

Good luck!