MapReduce & Hive application Design

Question

I have a design question where in in my CDH 4.1.2(Cloudera) installation I have daily rolling log data dumped into the HDFS. I have some reports to calculate the success and failure rates per day.

I have two approaches

load the daily log data into Hive Tables and create a complex query.
Run a MapReduce job upfront everyday to generate the summary (which is essentially few lines) and keep appending to a common file which is a Hive Table. Later while running the report I could use a simple select query to fetch the summary.

I am trying to understand which would be a better approach among the two or if there is a better one.

The second approach adds some complexity in terms of merging files. If not merged I would have lots of very small files which seems to be a bad idea.

Your inputs are appreciated.

Thanks

Charles Menguy Charles Menguy · Accepted Answer · 2013-01-24T00:26:41

Hive seems well suited to this kind of tasks, and it should be fairly simple to do:

Create an EXTERNAL table in Hive which should be partitioned by day. The goal is that the directory where you will dump your data will be directly in your Hive table. You can specify the delimiter of the fields in your daily logs like shown below where I use commas:
```
create external table mytable(...) partitioned by (day string) row format delimited keys terminated by ',' location '/user/hive/warehouse/mytable`
```
When you dump your data in HDFS, make sure you dump it on the same directory with day= so it can be recognized as a hive partition. For example in /user/hive/warehouse/mytable/day=2013-01-23.
You need then to let Hive know that this table has a new partition:
```
alter table mytable add partition (day='2013-01-23')
```
Now the Hive metastore knows about your partition, you can run your summary query. Make sure you're only querying the partition by specifying ... where day='2013-01-23'

You could easily script that to run daily in cron or something else and get the current date (for example with the shell date command) and substitute variables to this in a shell script doing the steps above.

MapReduce & Hive application Design

1 Answers