0
votes

What is the efficient way to export the data from hive/impala table with conditions into file(the data would be huge, close to 10 GB)? The format of the hive table is paraquet with snappy compressed and file is csv.

The table is partitioned daily and data needs to be extracted on daily basis, I would like to know if

1) Imapala approach

impala-shell -k -i servername:portname -B -q 'select * from table where year_month_date=$$$$$$$$' -o filename '--output_delimiter=\001'

2) Hive approach

Insert overwrite directory '/path' select * from table where year_month_date=$$$$$$$$

would be efficient

2
Add details: 1. file format of hive table 2. File format of the exported file - Dev

2 Answers

1
votes

Please try to use Dynamic Partitioning for your Hive/Impala table to efficiently export the data conditionally.

Partition your table with the columns of your interest and based on your queries for best results

Step 1: Create a Temporary Hive Table TmpTable and load your raw data into it

Step 2: Set hive parameters to support Dynamic partition

SET hive.exec.dynamic.partition.mode=non-strict;
SET hive.exec.dynamic.partition=true;

Step 3: Create your Main Hive Table with partition columns, example :

CREATE TABLE employee (
 emp_id int,
 emp_name string
PARTITIONED BY (location string)
STORED AS PARQUET;

Step 4: Load data from Temporary table to your employee table (Main Table)

insert overwrite table employee  partition(location)  
select emp_id,emp_name, location from TmpTable;

Step 5: export the data from hive with a condition

INSERT OVERWRITE DIRECTORY '/path/to/output/dir' SELECT * FROM employee  WHERE location='CALIFORNIA';

Please refer this link:

Dynamic Partition Concept

Hope this is useful.

3
votes

Assuming table tbl as your hive parquet table and condition as your filter condition.

CTAS command:

 CREATE TABLE tbl_text ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/tmp/data' AS select * from tbl where condition;

You will find your CSV text file (delimited by ',') at /tmp/data in HDFS.

You can get this file to your local file system if needed using:

hadoop fs -get /tmp/data