0
votes

I am trying to import a few large .csv files into HBase (>1TB in sum). The data looks like a dump from a relational DB, but does not have a UID. Also I do not want to import all columns. I decided I need to run a custom MapReduce job first to get them into the required format (select columns + generate UID) so that I can import them using the standard hbase importtsv bulk import.

My question: Can I just create my own composite row key, say storeID:year:UID using MapReduce and then feed it to the tsv import? So say, my data looks like this:

row_key | price | quantity | item_id
A:2012:1|  0.99 |        1 |     001
A:2012:2|  0.99 |        2 |     012
B:2013:1|  0.99 |        1 |     004

From what I understand, HBase stores everything as bytes, except for timestamps. Is it going to understand this is a composite key?!

Any hints are appreciated!

1

1 Answers

0
votes

I asked the same question over at Cloudera, and the answer can be found here.

Basically, the answer is yes, and no separator characters are needed. I used a MapReduce job to transform the data to the following format:

A2012:1,0.99,1,001 A2012:2,0.99,2,012

Using importtsv and completebulkload, the data was then correctly loaded into the correct HBase regions. I pre-split the table using the storeID (A,B,C,...).