I am using Kafka Spark Streaming to get streaming data.
val lines = KafkaUtils.createDirectStream[Array[Byte], String, DefaultDecoder, StringDecoder](ssc, kafkaConf, Set(topic)).map(_._2)I am using this DStream and processing RDDs
val output = lines.foreachRDD(rdd => rdd.foreachPartition { partition => partition.foreach { file => runConfigParser(file)} })runConfigParseris a JAVA method which parses a file and produces an output which i have to save in HDFS. So multiple nodes will process RDD and write output into one single HDFS file. As i want to load this fie in HIVE.
should I output the result of runConfigParser and use sc.parallze(output).saveAsTextFile(path) such that all my nodes will write RDD outputs to single HDFS file.? Is this design efficient ?
I will load this single HDFS file (which will be constantly updated as its streaming data) in HIVE and query using Impala.