2
votes

Couldn't find a straight answer on this anywhere. I'm joining an incoming dataset to several large tables that formerly lived in MySQL tables behind a web service. I dumped the tables to flat CSV files in Hadoop, and I'm using Pig to load the incoming dataset and table files, and to perform the joins.

It's slow going, because there are several table files to join with, and because the files themselves are so large. I'm just going for LEFT OUTER joins on a single field, nothing fancy.

So, my question is, is there any performance benefit to loading the CSV files into Hive tables and using HCatLoader within Pig instead of just loading CSV files? It doesn't seem like Hive provides any benefit besides a SQL-like interface to query tables, which doesn't matter when I'm just joining a dataset to the whole thing.

1
Why not just load your data into Hive and do all the work there? - Gordon Linoff
Are you sure that loading is your bottleneck? The outer joins seem like a more likely candidate. - gobrewers14
Gordon, I have project constraints that won't allow me to do that. @GoBrewers The joins are definitely the bottleneck, just wondering about the loading part. - economy

1 Answers

1
votes

As per Oreilly's Programming Pig(Page 165):

Hive provides SQL, it is a better tool for doing traditional data analytics. Most data analysts are already familiar with SQL, and business intelligence tools expect to speak to data sources in SQL. Pig Latin is a better choice when building a data pipeline or doing research on raw data.

So in simple words, Hive for better querying. PIG for building data pipelines. PIG over Hive for optimized querying.


PIG over Hive is useful if you are going to do many joins where partitioning maybe utilized.

If you only intend to do left joins it may not be required. Then again for further querying and analysis Hive would be better.

Read: difference b/w hive and pig and also read this.