Couldn't find a straight answer on this anywhere. I'm joining an incoming dataset to several large tables that formerly lived in MySQL tables behind a web service. I dumped the tables to flat CSV files in Hadoop, and I'm using Pig to load the incoming dataset and table files, and to perform the joins.
It's slow going, because there are several table files to join with, and because the files themselves are so large. I'm just going for LEFT OUTER joins on a single field, nothing fancy.
So, my question is, is there any performance benefit to loading the CSV files into Hive tables and using HCatLoader within Pig instead of just loading CSV files? It doesn't seem like Hive provides any benefit besides a SQL-like interface to query tables, which doesn't matter when I'm just joining a dataset to the whole thing.