I am new to Spark and Spark Streaming. I am working on Twitter streaming data. My task involves dealing with each Tweet independently like counting the number of words in each tweet. From what I have read, each input batch forms on RDD in Spark Streaming. So if I give a batch interval of 2 seconds,then the new RDD contains all the tweets for two seconds and any transformation applied will apply to whole two sec data and there is no way to deal with individual tweets in that two seconds. Is my understanding correct? or else each tweet forms a new RDD? I am kind of confused...
1 Answers
2
votes
In one batch you have a RDD containing all statuses that came in 2 seconds interval. Then you can process these statuses individually. Here is brief example:
JavaDStream<Status> inputDStream = TwitterUtils.createStream(ctx, new OAuthAuthorization(builder.build()), filters);
inputDStream.foreach(new Function2<JavaRDD<Status>,Time,Void>(){
@Override
public Void call(JavaRDD<Status> status, Time time) throws Exception {
List<Status> statuses=status.collect();
for(Status st:statuses){
System.out.println("STATUS:"+st.getText()+" user:"+st.getUser().getId());
//Process and store status somewhere
}
return null;
}});
ctx.start();
ctx.awaitTermination();
}
I hope I didn't misunderstand your question.
Zoran