1
votes

How does HBase know if a row contains a particular column or not? For example consider the following situation:

Let’ s say we have a table with one column family named “STAT_FAM”, with following two rows:

  • One row with key “R1”, that contains 1000 columns named S1 to S1000.
  • And another row with key “R2”, that contains another 1000 columns named from S2000 to S3000.

Now, if we are scanning the table, and the scan is defined as below:

Scan s = new Scan();
s.setStartRow ( Bytes.toBytes(“R1”) );
s.setStopRow ( Bytes.toBytes(“R2”) );
s.addColumn( STAT_FAM, Bytes.toBytes(“S500”) );
s.addColumn( STAT_FAM, Bytes.toBytes(“S2500”) );

When HBase does the scan, based on the rowkey, it will locate the record in a particular file on a particular region server.Once this is located, how does it find for a particular column in a row’s data?

For row “R1”, there is no column named “S2500”, so would it have to go through the complete record for this row to determine that the row does not contain the required column?

Thanks in advance!

1

1 Answers

3
votes

Let's understand first how HBase stores it's data. HFile format

The KeyValue of Hfile consists of:

<keylength> <valuelength> <key> <value>

Key is decomposed as:

<rowlength> <row> <columnfamilylength> <columnfamily> <columnqualifier> <timestamp> <keytype>

together it will be

<keylength> <valuelength> <key> <rowlength> <row> <columnfamilylength> <columnfamily> <columnqualifier> <timestamp> <keytype> <value>

HBase is “column family oriented.” Data is stored physically into column family groups. That means all key-values for a given column family are stored together in the same set of files.

HBase provides no indices over arbitrary columns, no joins, and no multi-row transactions. If you want to query for a row based on it’s column value, you’d better maintain a secondary index for that, or be prepared for a full table scan

Newer Versions of Hbase HFile has memory efficient fileformat. But doesn't guarantee looking up in all rows.

HFile V2 Diff encoding

Proof for it checks every row is here.

Class Name: org.apache.hadoop.hbase.filter.SingleColumnValueFilter

if (!keyValue.matchingColumn(this.columnFamily, this.columnQualifier)) {
      return ReturnCode.INCLUDE;
    }

if (filterColumnValue(keyValue.getBuffer(), keyValue.getValueOffset(), keyValue.getValueLength())) {
      return this.latestVersionOnly? ReturnCode.NEXT_ROW: ReturnCode.INCLUDE;
}

And From org.apache.hadoop.hbase.KeyValue Class from this method.

  /**
   *
   * @param family column family
   * @param qualifier column qualifier
   * @return True if column matches
   */
  public boolean matchingColumn(final byte[] family, final byte[] qualifier) {
    int rl = getRowLength();
    int o = getFamilyOffset(rl);
    int fl = getFamilyLength(o);
    int ql = getQualifierLength(rl,fl);
    if (!Bytes.equals(family, 0, family.length, this.bytes, o, fl)) {
      return false;
    }
    if (qualifier == null || qualifier.length == 0) {
      if (ql == 0) {
        return true;
      }
      return false;
    }
    return Bytes.equals(qualifier, 0, qualifier.length,
        this.bytes, o + fl, ql);
  }


Image credit: Cloudera-blog