1
votes

I have got a question regarding hbase databases. We access the data first by defining a row key, column family and in the last by column qualifier.

My question is will HBase store all column families with the same row key together in one node or not?

UPDATE: As an example, I want to multiply val1 and val2 in a map/reduce job. While val1 and val2 are stored in database like this: Row=00000 Column Family:M, m000001_1234567=val1, Row=00000 Column Family: R, r000001_1234567=val2. Can I make sure that I have access to both val1 and val2 in the same node running the map?

2

2 Answers

2
votes

As you might be aware its actually the HFile that has the actual key value data stored and it would be distributed accross the datanodes. The zookeeper / HLog /Memestore help in locating the rowkey data and retrieve it.

The Key-value storage would be grouped and stored in each node , say keys [A-L] goes to one node and the rest [M-z] to another node , considering 2 node scenario.

enter image description here

enter image description here

enter image description here

1
votes

Question 1: Will HBase store all column families with the same row key together in one node?

Yes, but there are a few special cases.

The recommened way to set up an HBase cluster is the collocated (or co-located) configuration: use the some machines for HDFS Data Nodes and HBase Region Servers (in contrast to dedicating the machines to specifically one of these roles, in which case all reads would be remote and performance would suffer). In such a setup, when a Region Server saves data to HDFS, the first replica of the data will always get saved to the local disk. However, the placement of any further replicas are not consistent - different parts may be placed on different nodes. This means that if a machine dies, no data will get lost, but the data of that region will not be found on any single machine any more, bit will be scattered all around the cluster instead. Even in this case, a single row will probably still to be stored on a single Data Node, but it won't be local to the new Region Server any more.

This is not the only way how data locality can get lost, previously even restarting HBase had this effect. A lot of older posts mention this, but this has actually been fixed since then in HBASE-2896.

Even if data locality gets lost, the next major compaction will restore it.

Sources and recommended reading:

Question 2: When reading an HBase table from a MapReduce job, does each mapper run on the node where the data it uses is stored?

My understanding is that apart from the special case mentioned above, the answer is yes, but I couldn't find this explicitly mentioned anywhere.

Sources and recommended reading: