0
votes

I have setup Hadoop fully distributed cluster and Apache Hive on it. I am loading data to hive tables from Java code. The replication factor in hdfs-site.xml is 2. When I copy files to HDFS from hadoop fs -put, the file is shown to be replicated twice. But the files that are loaded into hive tables are shown as to have 3 replicas.

Is there any different replication parameter to be set for hive loaded files?

2
Can you check the replication of other files in the cluster. - Amal G Jose

2 Answers

0
votes

To set the replication factor of a table while loading it to HIVE you need to set the following property on the hive client.

SET dfs.replication=2;
LOAD DATA LOCAL ......;
0
votes

Finally I was able to find the reason for this behaviour.

Before loading the file onto the table, I used to copy the file from local machine to HDFS using :

Configuration config = new Configuration();
config.set("fs.defaultFS","hdfs://mycluster:8020");
FileSystem dfs = FileSystem.get(config);
Path src = new Path("D:\\testfile.txt"); 
Path dst = new Path(dfs.getWorkingDirectory()+"/testffileinHDFS.txt");
dfs.copyFromLocalFile(src, dst);

The API copyFromLocalFile() used to keep 3 replicas by default (Even though I had kept replication factor as 2 in hdfs-site.xml. Don't know the reason for this behaviour though).

Now after explicitly specifying the replication factor in the code as follows :

Configuration config = new Configuration();
config.set("fs.defaultFS","hdfs://mycluster:8020");
config.set("dfs.replication", "1");  /**Replication factor specified here**/
FileSystem dfs = FileSystem.get(config);
Path src = new Path("D:\\testfile.txt"); 
Path dst = new Path(dfs.getWorkingDirectory()+"/testffileinHDFS.txt");
dfs.copyFromLocalFile(src, dst);

Now there is only one copy of the file in HDFS.