0
votes

My Hadoop version is - 2.6.0 -cdh5.10.0 I am using a Cloudera Vm.

I am trying to access the hdfs file system through my code to access the files and add it as input or a cache file.

When I try to access the hdfs file through command line am able to list the files.

Command :

[cloudera@quickstart java]$ hadoop fs -ls hdfs://localhost:8020/user/cloudera 
Found 5items
-rw-r--r--   1 cloudera cloudera        106 2017-02-19 15:48 hdfs://localhost:8020/user/cloudera/test
drwxr-xr-x   - cloudera cloudera          0 2017-02-19 15:42 hdfs://localhost:8020/user/cloudera/test_op
drwxr-xr-x   - cloudera cloudera          0 2017-02-19 15:49 hdfs://localhost:8020/user/cloudera/test_op1
drwxr-xr-x   - cloudera cloudera          0 2017-02-19 15:12 hdfs://localhost:8020/user/cloudera/wc_output
drwxr-xr-x   - cloudera cloudera          0 2017-02-19 15:16 hdfs://localhost:8020/user/cloudera/wc_output1

When I try to access the same thing through my map reduce program,I am receiving File Not Found exception. My Map reduce sample configuration code is :

public int run(String[] args) throws Exception {
		
		Configuration conf = getConf();
		
		if (args.length != 2) {
			System.err.println("Usage: test <in> <out>");
			System.exit(2);
		}
		
		ConfigurationUtil.dumpConfigurations(conf, System.out);
		
		LOG.info("input: " + args[0] + " output: " + args[1]);
		
		Job job = Job.getInstance(conf);
		
		job.setJobName("test");
		
		job.setJarByClass(Driver.class);
		job.setMapperClass(Mapper.class);
		job.setReducerClass(Reducer.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(DoubleWritable.class);
		
		
		job.addCacheFile(new Path("hdfs://localhost:8020/user/cloudera/test/test.tsv").toUri());
		
		
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		
		boolean result = job.waitForCompletion(true);
		return (result) ? 0 : 1;
	}

The line job.addCacheFile in the above snippet returns FileNotFound Exception.

2)My second question is :

My entry at core-site.xml points to localhost:9000 for default hdfs file system URI.But at the command prompt am able to access the default hdfs file system only at port 8020 and not at 9000.when I tried using port 9000,I ended up with ConnectionRefused Exception. I am not sure from where the configurations are read.

My core-site.xml is as follows :

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <!--  
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/Users/student/tmp/hadoop-local/tmp</value>
   <description>A base for other temporary directories.</description>
  </property>
-->
  
 <property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000</value>
  <description>Default file system URI.  URI:scheme://authority/path scheme:method of access authority:host,port etc.</description>
</property>
 
</configuration>

My hdfs-site.xml is as follows :

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

	<property>
		<name>dfs.name.dir</name>
		<value>/tmp/hdfs/name</value>
		<description>Determines where on the local filesystem the DFS name
			node should store the name table(fsimage).</description>
	</property>

	<property>
		<name>dfs.data.dir</name>
		<value>/tmp/hdfs/data</value>
		<description>Determines where on the local filesystem an DFS data node should store its blocks.</description>
	</property>
	
	<property>
		<name>dfs.replication</name>
		<value>1</value>
		<description>Default block replication.Usually 3, 1 in our case
		</description>
	</property>
</configuration>

I am receiving the following exception :

java.io.FileNotFoundException: hdfs:/localhost:8020/user/cloudera/test/   (No such file or directory)
  at java.io.FileInputStream.open(Native Method)
  at java.io.FileInputStream.<init>(FileInputStream.java:146)
  at java.io.FileInputStream.<init>(FileInputStream.java:101)
  at java.io.FileReader.<init>(FileReader.java:58)
  at hadoop.TestDriver$ActorWeightReducer.setup(TestDriver.java:104)
  at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:168)
  at        org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Any help will be useful!

1
can you share the argument which you are giving when you are trying to access the file through Map reducesiddhartha jain
@siddhartha jain :hadoop test.jar path-to-driverclass hdfs-path-to-input outputuser1477232
can you post exception which is throwing by programHari Singh
@HariSingh : I have updated the post with the exception am receiving.user1477232
@user1477232 if you will see logs hdfs:/localhost:8020/user/cloudera/test/ it is trying to get from this path but what i think it should be hdfs://localhost:8020/user/cloudera/test/ so give three slashes (hdfs:///localhost:8020/) or either don't give full path directly write the (/user/cloudera/test) by default it will take the hdfs pathHari Singh

1 Answers

0
votes

you are not required to give full path as an argument for accessing the file from hdfs. Namenode on it's own (from core-site.xml) will add the prefix of hdfs://host_address. You just need to mention the file you want to access along with the directory structure in your case which should be /user/cloudera/test .

Coming to your 2 question port no 8020 is the default port for hdfs. That is why you are able to access the hdfs at port 8020 even when you did not mention it. The reason for the connectionrefused exception is because hdfs get started at 8020 that is why port 9000 is not expecting any request thus it refused the connection.

refer here for more details about default ports