0
votes

Is there any way to list out files from Hadoop hdfs and store only the file names to the local?

example:

I have a file india_20210517_20210523.csv. I m currently copying the files from hdfs to local using copytolocal command but copying files to local is time-consuming as files are huge. All I need is the name of the files to be stored in a .txt file to perform cut operations using bash script.

Kindly help me

2
You can redirect output from hdfs list command, for example hdfs dfs -ls -C hdfs/path/you/want/files/from > file_list.out - mazaneicha

2 Answers

1
votes

The easiest way to do is to use the below command.

hdfs dfs -ls /path/fileNames | awk '{print $8}' | xargs -n 1 basename > Output.txt

How it works:

hdfs dfs -ls : This will list all the information about the path
awk '{print $8}' : To print the 8th column of the output
xargs -n 1 basename : To get the file names alone excluding the path
> Output.txt : To store the file names to a text file

Hope this answers your question.

0
votes

If you want to do this programmatically, you can use FileSystem and FileStatus objects from Hadoop to:

  1. list the contents of your (current or another) target directory,
  2. check if each of the records of this directory is either a file or another directory, and
  3. write the name of each file as a new line to a file stored locally.

The code for this type of application can look like this:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

import java.io.File;
import java.io.PrintWriter;


public class Dir_ls
{
    public static void main(String[] args) throws Exception 
    {
        // get input directory as a command-line argument
        Path inputDir = new Path(args[0]);  

        Configuration conf = new Configuration();

        FileSystem fs = FileSystem.get(conf);

        if(fs.exists(inputDir))
        {
            // list directory's contents
            FileStatus[] fileList = fs.listStatus(inputDir);

            // create file and its writer
            PrintWriter pw = new PrintWriter(new File("output.txt"));

            // scan each record of the contents of the input directory
            for(FileStatus file : fileList)
            {
                if(!file.isDirectory()) // only take into account files
                {
                    System.out.println(file.getPath().getName());
                    pw.write(file.getPath().getName() + "\n");
                }
            }

            pw.close();
        }
        else
            System.out.println("Directory named \"" + args[0] + "\" doesn't exist.");
    }
}

So if we want to list the files from the root (.) directory of HDFS, and we have these as the contents under it (notice how we both have directories and text files): enter image description here

This will be the command line output of the application: enter image description here

And this will be what's written inside the output.txt text file stored locally: enter image description here