1
votes

I am running on a POC environment where there are only one name node and one data node running. Impala daemon is running on data node. Both of the nodes have 128GB memory each. I had set the mem_limit to 60GB.

I had two big tables in Impala. First table has around 635 million records while second table is around 250000 records. I inner join this 2 tables using a common parameter. The SQL statement is as the following:

select a.*, b.* from table_a a inner join table_b b on a.param=b.param order by a.t_date desc

When i use EXPLAIN, it showed Estimated Per-Host Requirements: Memory=992.03MB VCores=2. When i run this query, it took more than one hour and the result yet to be return. I am wondering why it took so long. Is this related to mem_limit settings? How can i tune such query?

1

1 Answers

1
votes

Try tuning as Impala performance

Some ideal

  • Try big_table join small_table
  • Partition on param column
  • If have many query execute in the same time, you should enable Admission controll (2) and Dynamic Resource Pools (3)
  • Try execute summary after execute your query in impala-shell to see what step take long time.
  • And plz post all result of EXPLAIN statement

P/S: Sorry because im not enough reputation to post more than 2 link