0
votes

I am very new to Hbase concept. I understand that Underlying file system of HBase is HDFS only.

I just wanted to understand ,if in a single cluster I have some data already in HDFS . I try to import it in HBase (either using Pig/Hive scripts) , will it create another copy of the same data in HDFS (as underlying file system of HBase is HDFS) in the form which Hbase support (HFiles)?

Or it will create a reference to the same HDFS data ?

1

1 Answers

0
votes

Yes, it will store a copy of the imported data in HDFS (as StoreFiles/HFiles) since HBase can only operate with its own set of files. Perhaps you'll find this nice overview interesting.

You can directly operate with the data stored in HDFS without importing it into HBase with an EXTERNAL HIVE table:

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User',
     country STRING COMMENT 'country of origination')
 COMMENT 'This is the staging page view table'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
 STORED AS TEXTFILE
 LOCATION '<hdfs_location>';

In the Hadoop world, having multiple copies of the same data (although on different formats) should not be a problem because storage is not considered a limiting factor, it's cheap and easily scalable since it's based on commodity hardware. In fact, if you have enough input data, is very common that your Hive/Pig/MapReduce jobs take hundreds or even thousands of GBs (of intermediate data) just to process your jobs.