Joins on two large tables using UDF in Hive - performance is too slow

Question

I have two tables in hive. One has around 2 millions of records and other has 14 miliions of records. I am joining these two tables. Also I am applying UDF in WHERE clause. It is taking too much time to perform JOIN operation.

I have tried to run the query for many times but it run for around 2 hrs and still my reducer remains at 70% and after that I am getting exception "java.io.IOException: No space left on device" and job gets killed.

I have tried to set the parameters as below:

set mapreduce.task.io.sort.mb=256;
set mapreduce.task.io.sort.factor=100;
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.child.java.opts=-Xmx1024m;

My Query -

insert overwrite table output select col1, col2, name1, name2, col3, col4, 
t.zip, t.state from table1 m join table2 t ON (t.state=m.state and t.zip=m.zip) 
where matchStrings(concat(name1,'|',name2))>=0.9;

The above query takes 8 mappers and 2 reducers.

Can someone please suggest what do I suppose to do to improve performance.

juanpampliega juanpampliega · Accepted Answer · 2014-09-05T00:39:06

That exception probably indicates that you do not have enough space in the cluster for the temporary files created by the query you are running. You should try adding more disk space to the cluster or reducing the amount of rows that are joined by using a subquery to first filter the rows from each table.

Joins on two large tables using UDF in Hive - performance is too slow

1 Answers