0
votes

I have a Hadoop Cluster in Cloudera with 4 nodes, 1 master and 3 slave, and a replication factor of 3 and within a days my cluster doesn't stop to get bigger without any reason , i don't execute any job and the space left on device get smallest in a few minutes, and then i remove some files and change some things, there are the logs on my hadoop master and datanodes.

A portion of a logs files.

Hadoop Master Node

2015-07-17 09:30:49,637 INFO FSNamesystem.audit: allowed=true        ugi=hdfs (auth:SIMPLE)        ip=/172.20.1.45        cmd=listCachePools        src=null        dst=null        perm=null        proto=rpc
2015-07-17 09:30:49,649 INFO FSNamesystem.audit: allowed=true        ugi=hdfs (auth:SIMPLE)        ip=/172.20.1.45        cmd=create        src=/tmp/.cloudera_health_monitoring_canary_files/.canary_file_2015_07_17-09_30_49        dst=null        perm=hdfs:supergroup:rw-rw-rw-        proto=rpc
2015-07-17 09:30:49,684 INFO FSNamesystem.audit: allowed=true        ugi=hdfs (auth:SIMPLE)        ip=/172.20.1.45        cmd=open        src=/tmp/.cloudera_health_monitoring_canary_files/.canary_file_2015_07_17-09_30_49        dst=null        perm=null        proto=rpc
2015-07-17 09:30:49,699 INFO FSNamesystem.audit: allowed=true        ugi=hdfs (auth:SIMPLE)        ip=/172.20.1.45        cmd=delete        src=/tmp/.cloudera_health_monitoring_canary_files/.canary_file_2015_07_17-09_30_49        dst=null        perm=null        proto=rpc

Hadoop Data Node

2015-07-17 09:30:49,663 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-634864778-172.20.1.45-1399358938139:blk_1074658739_919097 src: /172.20.1.48:59941 dest: /172.20.1.46:50010
2015-07-17 09:30:49,669 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /172.20.1.48:59941, dest: /172.20.1.46:50010, bytes: 56, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-824197314_132, offset: 0, srvID: aa5e5f0e-4198-4df5-8dfa-6e7c57e6307d, blockid: BP-634864778-172.20.1.45-1399358938139:blk_1074658739_919097, duration: 4771606
2015-07-17 09:30:49,669 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-634864778-172.20.1.45-1399358938139:blk_1074658739_919097, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2015-07-17 09:30:51,406 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Scheduling blk_1074658739_919097 file /dfs/dn/current/BP-634864778-172.20.1.45-1399358938139/current/finalized/subdir13/subdir253/blk_1074658739 for deletion
2015-07-17 09:30:51,407 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Deleted BP-634864778-172.20.1.45-1399358938139 blk_1074658739_919097 file /dfs/dn/current/BP-634864778-172.20.1.45-1399358938139/current/finalized/subdir13/subdir253/blk_1074658739





pl.FsDatasetAsyncDiskService: Deleted BP-634864778-172.20.1.45-1399358938139 blk_1074658740_919098 file /dfs/dn/current/BP-634864778-172.20.1.45-1399358938139/current/finalized/subdir13/subdir253/blk_1074658740
2015-07-17 09:32:54,684 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-634864778-172.20.1.45-1399358938139:blk_1074658741_919099 src: /172.20.1.48:33789 dest: /172.20.1.47:50010
2015-07-17 09:32:54,725 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /172.20.1.48:33789, dest: /172.20.1.47:50010, bytes: 56, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_705538126_132, offset: 0, srvID: bff71ff1-db18-438a-b2ba-4731fa36d44e, blockid: BP-634864778-172.20.1.45-1399358938139:blk_1074658741_919099, duration: 39309294
2015-07-17 09:32:54,725 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-634864778-172.20.1.45-1399358938139:blk_1074658741_919099, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2015-07-17 09:32:55,909 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: RECEIVED SIGNAL 15: SIGTERM
2015-07-17 09:32:55,911 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: 

At this moment all my cluster services are stopped.

Do you know whats can happens? Any help would be appreciated Thanks a lot

1
Hello, what is your log level ? Do you have any flume service that was running ? - kulssaka
Can you type df -h command on the machine where you are getting the issue. Please provide the result you get after the command. - Sandeep Singh

1 Answers

0
votes

I added a few datanodes in PROD cluster which is running cloudera manager 5.4 and CDH5.4. Each node is configured as below:

12 disks each mounted on diff file system and /var and /tmp and OS on diff disks.

As soon as I added datanodes, each volume is immediately filled with 46.9 gb data(almost 5% of each disk's capacity). This is before running rebalancer.

    Each of disk is filled as below:

[root@data14-prod ~]# du -sh /dfs1/*
8.6G    /dfs1/dfs
16K     /dfs1/lost+found
331M    /dfs1/yarn
This usage doesn't account for missing 46gb space.

Swap space is set to 19gb from OS disk.

Output of df -h.
[root@data14-prod ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_data14prod-lv_root
                      147G   11G  129G   8% /
tmpfs                  63G   32K   63G   1% /dev/shm
/dev/sda1             477M   78M  374M  18% /boot
/dev/sdb1             917G  9.0G  861G   2% /dfs1
/dev/sdc1             917G   11G  860G   2% /dfs2
/dev/sdd1             917G  8.2G  862G   1% /dfs3
/dev/sde1             917G  9.6G  861G   2% /dfs4
/dev/sdf1             917G  8.8G  861G   2% /dfs5
/dev/sdg1             917G  8.8G  861G   2% /dfs6
/dev/sdh1             917G   11G  860G   2% /dfs7
/dev/sdi1             917G  9.0G  861G   2% /dfs8
/dev/sdj1             917G  8.2G  862G   1% /dfs9
/dev/sdk1             917G  9.2G  861G   2% /dfs10
/dev/sdl1             917G  8.4G  862G   1% /dfs11
/dev/sdm1             917G  7.5G  863G   1% /dfs12
/dev/mapper/vg_data14prod-lv_tmp
                       59G   54M   56G   1% /tmp
/dev/mapper/vg_data14prod-lv_var
                       50G  765M   46G   2% /var
cm_processes           63G  756K   63G   1% /var/run/cloudera-scm-agent/process


Cloudera config:
Disk    Mount Point Usage
/dev/sdl1   /dfs11  55.7 GiB/916.3 GiB
/dev/sdk1   /dfs10  53.9 GiB/916.3 GiB
/dev/sdm1   /dfs12  54.3 GiB/916.3 GiB
/dev/mapper/vg_data08prod-lv_var    /var    3.2 GiB/49.1 GiB
/dev/mapper/vg_data08prod-lv_tmp    /tmp    3.1 GiB/58.9 GiB
/dev/sda1   /boot   102.9 MiB/476.2 MiB
/dev/sdg1   /dfs6   54.7 GiB/916.3 GiB
cm_processes    /var/run/cloudera-scm-agent/process 756.0 KiB/63.0 GiB
/dev/mapper/vg_data08prod-lv_root   /   18.1 GiB/146.2 GiB
/dev/sdj1   /dfs9   54.6 GiB/916.3 GiB
/dev/sdi1   /dfs8   53.8 GiB/916.3 GiB
/dev/sdb1   /dfs1   56.3 GiB/916.3 GiB
/dev/sdd1   /dfs3   55.2 GiB/916.3 GiB
/dev/sdc1   /dfs2   55.6 GiB/916.3 GiB
/dev/sdf1   /dfs5   55.4 GiB/916.3 GiB
/dev/sde1   /dfs4   55.0 GiB/916.3 GiB
/dev/sdh1   /dfs7   55.0 GiB/916.3 Gi[output of df -h and du -h /dfs1/[File system as seen on cloudera][1]B
tmpfs   /dev/shm    16.0 KiB/63.0 GiB

Any ideas? where is my missing 46gb on each disk.
This is a huge issue because, combining all 12 disks and 16 datanodes which i added resulted in loss of 9TB disk space unaccounted for.


  [Cloudera config]: http://i.stack.imgur.com/XQcdg.jpg