How can I perform word count of multiple files present in a directory using Apache Spark with Scala?
All the files have newline delimiter.
O/p should be:
file1.txt,5
file2.txt,6 ...
I tried using below way:
val rdd= spark.sparkContext.wholeTextFiles("file:///C:/Datasets/DataFiles/")
val cnt=rdd.map(m =>( (m._1,m._2),1)).reduceByKey((a,b)=> a+b)
O/p I'm getting:
((file:/C:/Datasets/DataFiles/file1.txt,apple
orange
bag
apple
orange),1)
((file:/C:/Datasets/DataFiles/file2.txt,car
bike
truck
car
bike
truck),1)
I tried sc.textFile()
first, but didn't give me the filename.
wholeTextFile()
returns key-value pair, in which the key is the filename, but couldn't get the desired output.