I am running on a POC environment where there are only one name node and one data node running. Impala daemon is running on data node. Both of the nodes have 128GB memory each. I had set the mem_limit to 60GB.
I had two big tables in Impala. First table has around 635 million records while second table is around 250000 records. I inner join this 2 tables using a common parameter. The SQL statement is as the following:
select a.*, b.* from table_a a inner join table_b b on a.param=b.param order by a.t_date desc
When i use EXPLAIN, it showed Estimated Per-Host Requirements: Memory=992.03MB VCores=2. When i run this query, it took more than one hour and the result yet to be return. I am wondering why it took so long. Is this related to mem_limit settings? How can i tune such query?