I have two tables in hive. One has around 2 millions of records and other has 14 miliions of records. I am joining these two tables. Also I am applying UDF in WHERE clause. It is taking too much time to perform JOIN operation.
I have tried to run the query for many times but it run for around 2 hrs and still my reducer remains at 70% and after that I am getting exception "java.io.IOException: No space left on device" and job gets killed.
I have tried to set the parameters as below:
set mapreduce.task.io.sort.mb=256;
set mapreduce.task.io.sort.factor=100;
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.child.java.opts=-Xmx1024m;
My Query -
insert overwrite table output select col1, col2, name1, name2, col3, col4,
t.zip, t.state from table1 m join table2 t ON (t.state=m.state and t.zip=m.zip)
where matchStrings(concat(name1,'|',name2))>=0.9;
The above query takes 8 mappers and 2 reducers.
Can someone please suggest what do I suppose to do to improve performance.
WHEREclause anywhere is this query. - gobrewers14