Hive query having 15 tables join is expected to generate 1 Billion records, on 3 datanodes, 16GB RAM each Is this the right way to do?

Question

My name is Vitthal.

The Hortonworks HDP 2.4 Cluster on Amazon is 3 Datanodes, Masters on different Instances. 7 Instances 16GB RAM each. Total 1TB HDD Space 3 Data Nodes Hadoop version 2.7

I have pulled data from Postgres into Hadoop Distributed Environment. The Data is 15 Tables, Among them 4 tables are having 15 Million Records, rest are Masters. I've pulled them in HDFS, compressed as ORC, and SnappyCodec. Created Hive External Tables with schema.

Now I'm firing a query which joins all the 15 tables and selects the columns which I need in a final flat table. The records expected are more than 1.5 Billion.

I have optimized Hive, Yarn, MapReduce Engine viz. Parallel Execution, Vectorization, Optimized Joins, Small Table Condition, Heap Size etc.

The query is running on Cluster / Hive / Tez since 20 hours & it's reached 90% where the last reducer is running. The 90% is reached long back like since 18 hours it's stuck at 90%.

Am I doing it the right way ?

Tom Harrison Tom Harrison · Accepted Answer · 2016-04-17T17:40:45

If I understand, you have effectively copied tables in their raw form from your RDBMs into Hadoop in order to create a flattened view into one or more new tables. You're using Hive to do this. All of this sounds fine.

There are many possibilities why this is taking so long, but several come to mind.

First, YARN will allocate containers (one per CPU core, typically) that mappers and reducers will use to run the parallelized parts of the query. This should allow you to utilize all of the resources you have available.

I use Cloudera, but I assume Hortonworks has similar tools that let you see how many containers are in use, how many mappers and reducers are created by Hive, and so on. You should see that most or all of your available CPUs are in use constantly. Jobs should be finishing at some reasonable rate (perhaps every minute, or every 15 minutes). Depending on the query, Hive is often able to break it into distinct "stages" that are executed distinctly from others, then reassembled at the end.

If this is the case, everything may be fine, but your cluster may be under-resourced. But before you throw more AWS instances at the problem, consider the query itself.

First, Hive has several tools that are essential for optimizing performance, most importantly, partitioning. When you create tables, you should find some means of partitioning the resulting datasets into roughly equal subsets. A common method is to use dates, for example year+month+day (perhaps 20160417), or if you expect to have lots of historical data, maybe just year+month. This will also allow you to dramatically optimize queries that can be constrained by date. I seem to recall that Hive (or maybe it's YARN) will allocate partitions to different containers, so if you don't see all your workers working, then this would be a possible cause. Use the PARTITIONED BY clause in your CREATE TABLE statement.

The reason to choose something like date is that presumably your data is relatively evenly distributed over time (dates). We had chosen a customer_id as a partition key in an early implementation but as we grew, so did our customers. Hundreds of smaller customers would finish in a few minutes, then hundreds of mid-sized customers would finish in an hour, then a couple of our largest customers would take 10 or more hours to complete. We would see complete utilization of the cluster for that first hour, then only a couple containers in use for the last couple of customers. Not good.

This phenomenon is known as "data skew", so you want to carefully choose partitions to avoid skew. There are some options involving SKEW BY and CLUSTER BY that can help deal with getting evenly sized or smaller data files that you could consider.

Note that the raw import data should also be partitioned, as partitions act like indexes in a RDBMS, so are important for performance. In this case, choose partitions that use the keys that your larger query joins on. It is possible and common to have multiple partitions, so a date-based top-level partition, with a sub-partition on the join key could be helpful ... maybe ... depends on your data.

We have also found that it's very important to optimize the query itself. Hive has some hinting mechanisms that can direct it to run the query differently. While quite rudimentary compared to RDBMS, EXPLAIN is very helpful for understanding how Hive will break up the query and when it needs to scan a full dataset. It's hard to read the explain output, so get comfortable with the Hive documentation :-).

Lastly, if you can't make Hive do things in a sensible manner (if its optimizer still results in imbalanced stages) you can create intermediate tables with an additional Hive query that runs to create a partially transformed dataset before building the final one. This seems expensive since you're adding an additional write, and read of new tables, but in the case you describe it may be much faster overall. Also, it's sometimes useful to have intermediate tables just to test or sample data.

Writing Hive is a lot less like writing regular software -- you can get the Hive query done pretty quickly in most cases. Getting it to run fast has taken us 10 or 15 tries in a few cases. Good luck, and I hope this is helpful.

Hive query having 15 tables join is expected to generate 1 Billion records, on 3 datanodes, 16GB RAM each Is this the right way to do?

1 Answers