1
votes

I was trying to merge 80 GB files in a cluster using hadoop get merge ,

but as hadoop get merge has the property of copying the files from hdfs to local file system i have to copy to local and then copyFromLocal to hdfs again ,

hadoop fs -getmerge hdfs:///path_in_hdfs/* ./local_path

hadoop fs -copyFromLocal ./local_path hdfs://Destination_hdfs_Path/

My problem here is The datanode local is less than 80 GB,

I need to know is there an alternative to -getmerge where merge happens directly from HDFS to HDFS

I tried hadoop -cat also but it is not working..

3

3 Answers

2
votes

HDFS command with -cat option should work. Pipe the result of -cat command to the -put command.

hadoop fs -cat hdfs://input_hdfs_path/* | hadoop fs -put - hdfs://output_hdfs_path/output_file.txt
0
votes

Actually there is not a real alternative. You can achieve the same result via a MapReduce or Spark job (setting the parallelism for the output to 1), but there is not a solution using pure hdfs commands.

0
votes

Streaming may help. However the merged file will be in the sorted order (text before the first tab will be the key). If sorting is not desirable then streaming is not an option.

File 1

Tom     25
Pete    30
Kevin   26

File 2

Neil    28
Chris   31
Joe     27

Merged File

Chris   31
Joe     27
Kevin   26
Neil    28
Pete    30
Tom     25