0
votes

I am new to Scala and Spark. I am working on spark streaming with twitter data. I flatmapped the stream into individual words.Now, I need to eliminate tweet words like which start with #,@ and words like RT from streaming data before processing them. I knew it is quite easy to do.I wrote filter for this, but it is not working. Can anyone help on this. My code is

val sparkConf = new SparkConf().setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    val stream = TwitterUtils.createStream(ssc, None)
    //val lanFilter = stream.filter(status => status.getLang == "en")
    val RDD1 = stream.flatMap(status => status.getText.split(" "))
    val filterRDD = RDD1.filter(word =>(word !=word.startsWith("#")))
    filterRDD.print()

Also language filter is showing error.

Thank you.

2
Maybe you could show us the code you wrote, so we can help you better ?Peter Neyens
My code is like this - val sparkConf = new SparkConf().setMaster("local[2]") val ssc = new StreamingContext(sparkConf, Seconds(2)) val stream = TwitterUtils.createStream(ssc, None) //val lanFilter = stream.filter(status => status.getLang == "en") val RDD1 = stream.flatMap(status => status.getText.split(" ")) val filterRDD = RDD1.filter(word =>(word !=word.startsWith("#")))Naren
Edit your question and add the code in there. Comments have limited markdown support.Mark
@SNR Please read this so we can actually help you.Leb

2 Answers

2
votes

You can use a built in word filter support:

TwitterUtils.createStream(ssc, None, Array("filter", "these", "words")) 

But if you want to fix your code:

.filterNot(_.getText.startsWith("#"))

Regarding language, see this question.

0
votes

Is your lambda expression correct? I think you want:

val filterRDD = RDD1.filter(word => !word.startsWith("#"))