Load large csv in hadoop via Hue would only store a 64MB block

Question

Im using the Cloudera quickstart vm 5.1.0-1

Im trying to load my 3GB csv in Hadoop via Hue and what I tried so far is: - Load the csv into the HDFS and specifically into a folder called datasets positioned at /user/hive/datasets - Use the Metastore Manager to load it into the default db

Everything works fine meaning that I manage to load it with the right columns. The main problem is that when I query the table with Impala launching the following query:

show table stats new_table

I realize that the size is only 64 MB instead of the actual size of the csv which should be 3GB.

Also, if I do a count(*) via Impala the number of rows is only 70000 against the actual 7 million.

Any help would be deeply appreciated.

Thanks in advance.

Nope. So far I found a workaround creating the table via Impala and loading it from there but in that case I have to specify the create table statement including all the 140 columns I have. Something like create table (A string, B int, etc). This is painful to manage, that is why I was looking for a more automated solution. — bobo32

JamCon JamCon · Accepted Answer · 2014-10-27T02:53:46

I've had the exact same problem. This is an issue with how Hue imports the file via the web interface, which has a 64MB limit.

I've been importing large datasets by using the Hive CLI and the -f flag against a text file with the DDL code.

Example:

hive -f beer_data_loader.hql

beer_data_loader.hql:

  CREATE DATABASE IF NOT EXISTS beer  
  COMMENT "Beer Advocate Database";


CREATE TABLE IF NOT EXISTS beer.beeradvocate_raw(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate Data Raw"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 STORED AS parquet;


CREATE EXTERNAL TABLE IF NOT EXISTS beer.beeradvocate_temp(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate External Loading Table"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 LOCATION '/user/name/beeradvocate.data';


INSERT OVERWRITE TABLE beer.beeradvocate_raw SELECT * FROM beer.beeradvocate_temp;  
DROP TABLE beer.beeradvocate_temp;

Load large csv in hadoop via Hue would only store a 64MB block

3 Answers