importing compressed (lzo) data from s3 to hive

Question

I export my DynamoDB tables to s3 as a means of backup (via EMR). When I export, I store the data as lzo compressed file. My hive query is below, but essentially I followed the "To export an Amazon DynamoDB table to an Amazon S3 bucket using data compression" at http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMR_Hive_Commands.html

I now want to do the reverse - take my LZO file(s) and get them back into a hive table. How do you do this? I was expecting to see some hive configuration property for input, but there is not. I've googled and found some hints, but nothing definitive and nothing that works.

Files in s3 are in the format: s3://[mybucket]/backup/year=2012/month=08/day=01/000000.lzo

Here is my HQL that does the export:

SET dynamodb.throughput.read.percent=1.0;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;      

CREATE EXTERNAL TABLE hiveSBackup (id bigint, periodStart string, allotted bigint, remaining bigint, created string, seconds bigint, served bigint, modified string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "${DYNAMOTABLENAME}", 
"dynamodb.column.mapping" = "id:id,periodStart:periodStart,allotted:allotted,remaining:remaining,created:created,seconds:seconds,served:served,modified:modified");

CREATE EXTERNAL TABLE s3_export (id bigint, periodStart string, allotted bigint, remaining bigint, created string, seconds bigint, served bigint, modified string)
 PARTITIONED BY (year string, month string, day string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 's3://<mybucket>/backup';

INSERT OVERWRITE TABLE s3_export
 PARTITION (year="${PARTITIONYEAR}", month="${PARTITIONMONTH}", day="${PARTITIONDAY}")
 SELECT * from hiveSBackup;

Any ideas how to get it from s3, decompress, and into hive table??

Tim Tim · Accepted Answer · 2012-09-13T18:13:50

Hive on EMR can natively read data directly from S3, you don't need to import anything. You just have to create an external table and tell it where the data is. It also has lzo support setup. Hive will automatically decompress using lzo if the files end with a .lzo extension.

So to "import" your lzo data in s3 into hive, you just create an external table pointing to your lzo compressed data s3, and hive will decompress it whenever it runs a query over it. Pretty much exactly what you did when you "exported" data. That s3_export table, you can also read from.

If you want to import it to a non-external table, just insert overwrite into a new table and select from the external table.

Unless I misunderstood your question and you meant to ask about importing to dynamo, not just a hive table?

This is what I've been doing
SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

CREATE EXTERNAL TABLE users
(id int, username string, firstname string, surname string, email string, birth_date string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://bucket/someusers';

INSERT OVERWRITE TABLE users
SELECT * FROM someothertable;

I end up with a bunch of files under s3://bucket/someusers with .lzo extensions which are readable by hive.

You only need to SET the codec when trying to write compressed data, reading it the compression is detected automatically.

importing compressed (lzo) data from s3 to hive

1 Answers