I recently did analysis on a static log file with Spark SQL (find out stuff like the ip addresses which appear more than ten times). The problem was from this site. But I used my own implementation for it. I read the log into an RDD, turned that RDD to a DataFrame (with the help of a POJO) and used DataFrame operations.
Now I'm supposed to do a similar analysis using Spark Streaming for a streaming log file for a window of 30 mins as well as aggregated results for a day. The solution can again be found here but I want to do it another way. So what I've done is this
Use Flume to write data from the log file to an HDFS directory
Use JavaDStream to read the .txt files from HDFS
Then I can't figure out how to proceed. Here's the code I use
Long slide = 10000L; //new batch every 10 seconds
Long window = 1800000L; //30 mins
SparkConf conf = new SparkConf().setAppName("StreamLogAnalyzer");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, new Duration(slide));
JavaDStream<String> dStream = streamingContext.textFileStream(hdfsPath).window(new Duration(window), new Duration(slide));
Now I can't seem to decide if I should turn each batch to a DataFrame and do what I previously did with the static log file. Or is this way time consuming and overkill.
I'm an absolute noob to Streaming as well as Flume. Could someone please guide me with this?