5
votes

Im using the Cloudera quickstart vm 5.1.0-1

Im trying to load my 3GB csv in Hadoop via Hue and what I tried so far is: - Load the csv into the HDFS and specifically into a folder called datasets positioned at /user/hive/datasets - Use the Metastore Manager to load it into the default db

Everything works fine meaning that I manage to load it with the right columns. The main problem is that when I query the table with Impala launching the following query:

show table stats new_table

I realize that the size is only 64 MB instead of the actual size of the csv which should be 3GB.

Also, if I do a count(*) via Impala the number of rows is only 70000 against the actual 7 million.

Any help would be deeply appreciated.

Thanks in advance.

3
Before doing the load, is the CSV file already truncated?Romain
Nope. So far I found a workaround creating the table via Impala and loading it from there but in that case I have to specify the create table statement including all the 140 columns I have. Something like create table (A string, B int, etc). This is painful to manage, that is why I was looking for a more automated solution.bobo32

3 Answers

6
votes

I've had the exact same problem. This is an issue with how Hue imports the file via the web interface, which has a 64MB limit.

I've been importing large datasets by using the Hive CLI and the -f flag against a text file with the DDL code.

Example:

hive -f beer_data_loader.hql



beer_data_loader.hql:

  CREATE DATABASE IF NOT EXISTS beer  
  COMMENT "Beer Advocate Database";


CREATE TABLE IF NOT EXISTS beer.beeradvocate_raw(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate Data Raw"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 STORED AS parquet;


CREATE EXTERNAL TABLE IF NOT EXISTS beer.beeradvocate_temp(  
    beer_name           STRING,
    beer_ID             BIGINT,
    beer_brewerID       INT,
    beer_ABV            FLOAT,
    beer_style          STRING,
    review_appearance   FLOAT,
    review_aroma        FLOAT,
    review_palate       FLOAT,
    review_taste        FLOAT,
    review_overall      FLOAT,
    review_time         BIGINT,
    review_profileName  STRING,
    review_text         STRING
    )
 COMMENT "Beer Advocate External Loading Table"
 ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
 LOCATION '/user/name/beeradvocate.data';


INSERT OVERWRITE TABLE beer.beeradvocate_raw SELECT * FROM beer.beeradvocate_temp;  
DROP TABLE beer.beeradvocate_temp; 
2
votes

Seems like a bug in Hue. Found a workaround. The file gets truncated if you select the "Import data from file" checkbox when you create the table. Leave that unchecked to create an empty table. Then select the newly created table in the Metastore Manager and use the "Import Data" option in the Actions menu to populate it. This should populate all the rows.

0
votes

This bug (HUE-2501) was happening when importing a file with headers larger than 64MB.

Peter's workaround is good and it is fixed in Hue 3.8 and since CDH5.3.2.