I am sending streams to HDFS and trying to read text file using spark.
JavaStreamingContext jssc = new JavaStreamingContext(jsc, new
Duration(1000));
JavaPairInputDStream<LongWritable, Text> textStream =
jssc.fileStream("hdfs://myip:9000/travel/FlumeData.[0-9]*",
LongWritable.class, Text.class, TextInputFormat.class);
while sending streams to hdfs some FlumeData.1234.tmp files are created and once full data is received this file is converted into proper file eg. FlumeData.1234
I want to ignore this .tmp files to be read.from spark. I tried using regex
hdfs://myip:9000/travel/FlumeData.[0-9]* hdfs://myip:9000/travel/FlumeData.//d*
but they are not working. I am looking for something like this jssc.fileStream("hdfs://myip:9000/travel/FlumeData.[0-9]*", LongWritable.class, Text.class, TextInputFormat.class);
fileStream should not read .tmp from file extension.
I also tried following Hadoop code to retrieve list of flies
private String pathValue(String PathVariable) throws IOException{
Configuration conf = new Configuration();
Path path = new Path(PathVariable);
FileSystem fs = FileSystem.get(path.toUri(), conf);
System.out.println("PathVariable" + fs.getWorkingDirectory());
return fs.getName();
}
but it FileSystem object fs dont have filename(). Since new files are created at run time. I need to read as they created.