I had found the reason of this phenomenon.There was some thing wrong in splitting process of some regions,which were always in transition and had never complete their splitting process,and this caused the balancer cannot run normally.
look at the balancer code snippy of at HMster.java:
public boolean balance() throws IOException {
//...
if (this.assignmentManager.getRegionStates().isRegionsInTransition()) {
Map<String, RegionState> regionsInTransition =
this.assignmentManager.getRegionStates().getRegionsInTransition();
LOG.debug("Not running balancer because " + regionsInTransition.size() +
" region(s) in transition: " + org.apache.commons.lang.StringUtils.
abbreviate(regionsInTransition.toString(), 256));
return false;
}
//...
}
The "if" statement were always true so this method always returned false,and would not run the code below which actually balance the region server cluster.
I don't know what caused the failure of spliting of some regions,but when I tried to move one region from one region server to another,I found the error message in region server:
2018-05-17 13:11:12,695 ERROR [B.defaultRpcServer.handler=99,queue=9,port=26020] regionserver.RSRpcServices: Failed warming up region tsdb,\x00\x12\x19Z\xD2P,1525840795373.c3ebb018b9c3fc101a7b9def9100fb5f.
java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not exist: /hbase-holmes/data/default/tsdb/32ef153360b7a9499e555a7937418ee7/t/a6cdb25689234e539ed82230ed7b790f
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:943)
at org.apache.hadoop.hbase.regionserver.HRegion.initializeWarmup(HRegion.java:967)
at org.apache.hadoop.hbase.regionserver.HRegion.warmupHRegion(HRegion.java:6554)
at org.apache.hadoop.hbase.regionserver.RSRpcServices.warmupRegion(RSRpcServices.java:1709)
at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22241)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2188)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:745)
...
The region I wanted to move was c3ebb018b9c3fc101a7b9def9100fb5f but the error said what cannot found is files in region 32ef153360b7a9499e555a7937418ee7,later I found that the region c3ebb018b9c3fc101a7b9def9100fb5f is the daughter of region 32ef153360b7a9499e555a7937418ee7.
Then I checked hdfs,I found the parent region was missing ,and reference file in it's daughter region which point to parents store file was present.That is to say, the reference files in daughter regions pointed some non-existing files.
So,region server found the reference file in daught regions but cannot find the parents regions and then throwed this Exception.
finally,I removed the reference file of thoes splitting regions,and the balancer begun work normally.But I don't know if there is some data lost.