1
votes

I have some scripts which process my website's logs. I have loaded this data into multiple tables in Hive. I run these scripts on daily basis to do the analysis of the traffic.

Lately I am seeing that the hive queries which I have written in these scripts is taking too much time. Earlier, it used to take around 10-15 mins to generate the reports, but now it takes hours to do the same.

I did the analysis of the data and its around 5-10% of increase in dataset.

One of my friends suggested me that Hive is not good when it comes to joining multiple hive tables and I should switch my scripts to Pig. Is Hive bad at joining tables when compared to Pig?

1
Try it I guess. I have had better experience with Pig than Hive.Donald Miner

1 Answers

1
votes
Is Hive bad at joining tables

No. Hive is actually pretty good, but sometimes it takes a bit playing around with the query optimizer.

Depending on which version of Hive you use, you may need to provide hints in your query to tell the optimizer to join the data using a certain algorithm. You can find some details about different hints here.

If you're thinking about using Pig, I think your choice should not be motivated only by performance considerations. In my experience there is no quantifiable gain in using Pig, I have used both over the past years, and in terms of performance there is no clear winner.

What Pig gives you however is more transparency when defining what kind of join you want to use instead of relying on some (sometimes obscure) optimizer hints.

In the end, Pig or Hive doesn't really matter, it just depends how you decide to optimize your queries. If you're considering switching to Pig, I would first really analyze what your needs in terms of processing are, as you'll probably fall even in terms of performance. Here is a good post if you want to compare the 2.