How can I stream files already present in HDFS using apache spark?
I have a very specific use case where I have millions of customer data and I want to process them at a customer level using apache stream. Currently what I am trying to do is I am taking entire customer dataset and repartition it on customerId and creating 100 such partitions and ensuring unique customer multiple records to be passed in a single stream.
Now I have all the data present in HDFS location
hdfs:///tmp/dataset
Now using the above HDFS location I want to stream the files which will read the parquet file get the dataset. I have tried the following things but no luck.
// start stream
val sparkConf = new SparkConf().setAppName("StreamApp")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(60))
val dstream = ssc.sparkContext.textFile("hdfs:///tmp/dataset")
println("dstream: " + dstream)
println("dstream count: " + dstream.count())
println("dstream context: " + dstream.context)
ssc.start()
ssc.awaitTermination()
NOTE: This solution doesn't stream data it just reads data from HDFS
and
// start stream
val sparkConf = new SparkConf().setAppName("StreamApp")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(60))
val dstream = ssc.textFileStream("hdfs:///tmp/dataset")
println("dstream: " + dstream)
println("dstream count: " + dstream.count())
println("dstream context: " + dstream.context)
dstream.print()
ssc.start()
ssc.awaitTermination()
I am always getting 0 result. Is is possible to stream files from HDFS if is already present in HDFS where no new files are publishing.
hdfs:///tmp/dataset
after streaming context is started. – shanmuga