Passing a file to Hadoop using the -files argument

Question

I have a MapReduce programme that executes correctly locally.

It uses a file called new-positions.csv in the setup() method of the mapper class to populate a hash table in memory:

public void setup(Context context) throws IOException,  InterruptedException {
        newPositions = new Hashtable<String, Integer>();
        File file = new File("new-positions.csv");

        Scanner inputStream = new Scanner(file);
        String line = null;
        String firstline = inputStream.nextLine();
        while(inputStream.hasNext()){
            line = inputStream.nextLine();
            String[] splitLine = line.split(",");
            Integer id = Integer.valueOf(splitLine[0].trim());
            // String firstname = splitLine[1].trim();
            // String surname = splitLine[2].trim();
            String[] emails = new String[4];
            for (int i = 3; i < 7; i++) {
                emails[i-3] = splitLine[i].trim();
            }
            for (String email : emails) {
                if (!email.equals("")) newPositions.put(email, id);
            }
            // String position = splitLine[7].trim();
            inputStream.close();
        }   
    }

The Java programme has been exported to an executable JAR. That JAR and full-positions.csv are both saved in the same directory on our local filesystem.

Then, while inside that directory we execute the following at the terminal (we have also tried it with the full pathname for new-positions.csv):

hadoop jar MR2.jar Reader2 -files new-positions.csv InputDataset OutputFolder

It executes fine, but when it gets to the mapper we get:

Error: java.io.FileNotFoundException: new-positions.csv (No such file or directory)

This file definitely exists locally, and we are definitely executing from within that directory.

We are following the guidance given in Hadoop: The Definitive Guide (4th Ed.), p. 274 onwards, and cannot see how our program and arguments differ in structure.

Could it be something to do with the Hadoop configuration? We know that there are workarounds, such as copying the file to HDFS and then executing from there, but we need to understand why this "-files " argument isn't working as anticipated.

EDIT: Below is some code from the driver class, which may also be the source of the problem:

public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException { if (args.length != 5) { printUsage(this, " "); return 1; }

     Configuration config = getConf();

     FileSystem fs = FileSystem.get(config);

     Job job = Job.getInstance(config);
     job.setJarByClass(this.getClass());
     FileInputFormat.addInputPath(job, new Path(args[3]));

     // Delete old output if necessary
     Path outPath = new Path(args[4]);
     if (fs.exists(outPath)) 
         fs.delete(outPath, true);

     FileOutputFormat.setOutputPath(job, new Path(args[4]));

     job.setInputFormatClass(SequenceFileInputFormat.class);

     job.setOutputKeyClass(NullWritable.class);
     job.setOutputValueClass(Text.class);

     job.setMapOutputKeyClass(EdgeWritable.class);
     job.setMapOutputValueClass(NullWritable.class);

     job.setMapperClass(MailReaderMapper.class);
     job.setReducerClass(MailReaderReducer.class);

     job.setJar("MR2.jar");


     boolean status = job.waitForCompletion(true);
     return status ? 0 : 1;
 }

 public static void main(String[] args) throws Exception {
     int exitCode = ToolRunner.run(new Reader2(), args);
     System.exit(exitCode);
 }

Manjunath Ballur Manjunath Ballur · Accepted Answer · 2016-04-18T15:54:58

Let's assume that your "new-positions.csv" is present in folder: H:/HDP/, then you need to pass this file as:

file:///H:/HDP/new-positions.csv

You need to qualify path with file:///, to indicate that it is a local file system path. Also, you need to pass the fully qualified path.

This works perfectly for me.

For e.g., I pass the local file myini.ini as below:

yarn jar hadoop-mapreduce-examples-2.4.0.2.1.5.0-2060.jar teragen -files "file:///H:/HDP/hadoop-2.4.0.2.1.5.0-2060/share/hadoop/common/myini.ini" -Dmapreduce.job.maps=10 10737418 /usr/teraout/

Passing a file to Hadoop using the -files argument

3 Answers