0
votes

I am using PIG and HBASE to store some information into DB. I have a dataset taken from DUMP command and which is going to store in HBASE in next stage.

DUMP somedata;

produces chunk of data which having duplicate rows like below.

(rowkey, cf:1, cf:2 ....)
(12345::456::idea, 4567, deleted, 2.3, next, super)
(12345::456::idea, 4567, deleted, 2.3, next, super)
(12345::456::idea, 4567, deleted, 2.3, next, super)
(12345::456::idea, 4567, deleted, 2.3, next, super)
(12345::456::idea, 4568, deleted, 2.3, next, super)
(12345::456::idea, 4568, deleted, 2.3, next, super)
(12345::456::idea, 4568, deleted, 2.3, next, super)
(12345::456::idea, 4569, deleted, 2.3, next, super)
(12345::456::idea, 4569, deleted, 2.3, next, super)
(12345::456::idea, 4569, deleted, 2.3, next, super)

When use STORE command to store somedata with HBaseStorage, then all the duplicate rows eliminated and stores distinct rows. I am not sure is that expected behaviour or not.

out of the above only it stores

(12345::456::idea, 4567, deleted, 2.3, next, super)
(12345::456::idea, 4568, deleted, 2.3, next, super)
(12345::456::idea, 4569, deleted, 2.3, next, super)

And some times it does misses some rows to store even.

Can any one clarify this?

1
can u add plz your code and hbase table definition ?54l3d

1 Answers

0
votes

This is how HBase is designed ! It just appends data as per the family:column name. You set a KEY for HBASE , then if 4 records come with same key , ultimately it will store one record only. For Eg :

ID , NAME, AGE

1,SAM,20
2,RAJ,25
1,ANN, 27

If ID is set as KEY , then HBASE will only have

1 ANN 27,
2,RAJ,25

Next if you insert some more data :

id,hometown
1,Bangalore
5 Jaipur

HABSE will have :

1 ANN 27,Bangalore
2,RAJ,25
5 Jaipur

Iff you want to save all records , you will have to use concept of composite keys.