1
votes

I want to understand how Hbase internally handles duplicates records from a file. In order to experiment this, I have created an EXTERNAL table in hive with HBase specific configuration properties like table properties, SERDE, column family. I have to create the table in HBase with column family as well, which I did.

I have performed an insert overwrite into this HIVE table from a source table which has duplicate records. By duplicate records I mean like this,

ID | Name        | Surname
 1 | Ritesh      | Rai
 1 | RiteshKumar | Rai

Now after performing insert overwrite, I queried my HIVE table with id 1, I got the output as (the second one)

 1        RiteshKumar         Rai

I wanted to under how HBase decides which one is updated? Is it just that it just writes the data in a sequential manner. The last record will be overwritten in and considered as latest? Or how it is?

Thanks in advance.

Regards, Govind

1

1 Answers

2
votes

You are on the right track!

HBase datamodel can be seen as a 'multidimensional map' and each cell value is associated with a timestamp (insertion_time by default):

row:column_family:column_qualifier:timestamp:value

NOTE: The timestamp is associated with each single value and not the entire row (This enables several nice features)!

At read time you will get the latest versions by default unless you specify otherwise. By default 3 versions should be stored. Hbase does a 'merge read' and it will return the latest cell value for each row.

Please try this from your hbase-shell (not really tested before posting):

put ‘table_name’, ‘1’, ‘f:name’, ‘Ritesh’
put ‘table_name’, ‘1’, ‘f:surname’, ‘Rai’
put ‘table_name’, ‘1’, ‘f:name’, ‘RiteshKumar’
put ‘table_name’, ‘1’, ‘f:surname’, ‘Rai’
put ‘table_name’, ‘1’, ‘f:other’, ‘Some other stuff’

// Data on 'disk' (that might just be the memstore for now) will look like this:
// 1:f:name:1234567890:‘Ritesh’
// 1:f:surname:1234567891:‘Rai’
// 1:f:name:1234567892:‘RiteshKumar’
// 1:f:surname:1234567893:‘Rai’
// 1:f:other:1234567894:‘Some other stuff’

// Now try... And you will get ‘RiteshKumar’, ‘Rai’, ‘Some other stuff’
get ‘table_name’, ‘1’

// To get the previous versions of the data use the following:
get ‘table_name’, ‘1’, {COLUMN => ‘f’, VERSIONS => 2}

Don't forget to take a look at the best practices of schema design