Can't download or read Hive output in Amazon S3 bucket

Question

I'm new to AWS and Hive, and I'm trying to use Hive to analyze Google Ngrams data. I tried to save a table as tab-delimited CSV in an S3 bucket, but now I don't know how to view it or download it to see if my job executed correctly.

The query I used to create the table was

CREATE EXTERNAL TABLE test_table2 (
 gram string,
 year int,
 occurrences bigint,
 pages bigint,
 books bigint
 )
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://mybucket/sub-bucket/test-table2.txt';

I then filled the table with data:

INSERT OVERWRITE TABLE test_table2
SELECT
 gram,
 year,
 occurrences,
 pages,
 books
FROM
 eng1m_5grams_normed
WHERE
 gram = 'early bird gets the worm';

The query ran fine, and I think everything worked correctly. However, when I navigate to my bucket in the S3 Management Console online, the text file appears as a folder containing a bunch of files. These files have long hexadecimal character names and are 0 bytes big.

Is this just the text file represented as a directory? Is there a way I can view or download the file to see if my query worked? I tried to make the directory public so I could download it, but the download button in the "Actions" dropdown menu is still greyed out.

kgu87 kgu87 · Accepted Answer · 2013-05-25T00:11:05

In Hive/S3 , think of S3 directories as tables. The files contained in those directories are contents of those tables (i.e. rows). The reason you have multiple files in the directory is because multiple reducers are writing the "table".

S3 Browser is a very nice tool for working with S3.

Can't download or read Hive output in Amazon S3 bucket

2 Answers