0
votes

I have file stored on HDFS and I need to get its size. I used the following line at the command prompt to get the file size

hadoop fs -du -s train.csv | awk '{{s+=$1}} END {{printf s}}

I know that Hadoop stores duplicates of files decided by the replication factor. So when I run the line above, is the returned size the file size time the replication factor or just the file size?

1

1 Answers

0
votes

From Hadoop documentation:

The du returns three columns with the following format: size disk_space_consumed_with_all_replicas full_path_name https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html

As you can see the first column is size of file, while second column is space consumed including replicas.