1
votes

I am running a simple query in redshift where the source table has 26 million records and the other table has 120k records. One issue is I don't have a direct relation between the two tables so I am using another foreign key which is common in both the table to make the join. When I gave a limit the query is completing within seconds, however when running without limit it is running forever

 `select a.person_name, b.city from persons a  left outer join  address b on a.zip_code= b.zip_code`

`QUERY PLAN
XN Hash Left Join DS_BCAST_INNER  (cost=1559.11..39246081357.77 rows=224197624453 width=15)
           Hash Cond: ("outer".zip_code = "inner".zip_code)
  ->  XN Seq Scan on persons a  (cost=0.00..269134.10 rows=26913410 width=16)
  ->  XN Hash  (cost=1247.29..1247.29 rows=124729 width=15)
        ->  XN Seq Scan on address b  (cost=0.00..1247.29 rows=124729 width=15)`


Any help would be appreciated.
1

1 Answers

2
votes

You probably have a bunch of zip codes that have duplicate values in both tables. You can calculate the size of the intermediate table from the JOIN using:

select sum(p.cnt * a.cnt)
from (select zip_code, count(*) as cnt
      from persons
      group by zip_code
     ) p join
     (select zip_code, count(*) as cnt
      from address
      group by zip_code
     ) a
     on p.zip_code = a.zip_code;

You may find that this is actually generating billions and billions of rows -- if not more. That is why the query is taking so long.

You would need appropriate join conditions to fix the problem.