0
votes

My current system is architected in this way.

Log parser will parse raw log at every 5 minutes with format TSV and output to HDFS. I created Hive table out of the TSV file from HDFS.

From some benchmark, I found that Parquet can save up to 30-40% of the space usage. I also found that I can create Hive table out of Parquet file starting Hive 0.13. I would like know if I can convert TSV to Parquet file.

Any suggestion is appreciated.

1

1 Answers

0
votes

Yes, in Hive you can easily convert from one format to another by inserting from one table to the other.

For example, if you have a TSV table defined as:

CREATE TABLE data_tsv
(col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

And a Parquet table defined as:

CREATE TABLE data_parquet
(col1 STRING, col2 INT)
STORED AS PARQUET;

You can convert the data with:

INSERT OVERWRITE TABLE data_parquet SELECT * FROM data_tsv;

Or you can skip the Parquet table DDL by:

CREATE TABLE data_parquet STORED AS PARQUET AS SELECT * FROM data_tsv;