Convert Proc Sql Exists Query to Data Step

Question

I have this proc sql query in my current code. Unfortunately, I'm dealing with over 10 million records, so it takes hours to run. I've been trying to convert it to a data step, thinking it would run much quicker. However, I can't seem to get the same data results. If anyone can help me with the data step I'd greatly appreciate it. Or if you have suggestions on how I can make the proc sql run more efficiently.

Here is my proc sql query:

proc sql;
  create table test as
  select *
  from table1 a
  where exists (select 1
                from table2 b
                where b.acct_id = a.acct_id);
quit;

This is the data step I tried converting it to:

proc sort data=table1; by acct_id; run;
proc sort data=table2; by acct_id; run;

data test;
  merge table1   (in=a)
        table2   (in=b);
  by acct_id;
  if a and b;
run;

Does select * from table1 where acct_id in (select acct_id from table2) perform any faster? (That query might make the intent more clear to the query planner - hopefully saving the need for indexing/sorting/joining) — Jon Clements
At least in SAS 9.3 the proc sql is quite bad at optimizing IN/EXIST statements. — Jetzler
I just did a small scale test with your suggestion and it seemed substantially faster (11 minutes vs 32 seconds). I'm going to try a full run now and see how it goes. — UncleCross

DomPazz DomPazz · Accepted Answer · 2016-04-11T12:34:50

Try an inner join in SQL. You will have to list out each of the variables that need to match.

create table test as
select *
    from
        table1 as a
      inner join
        table2 as b
      on a.acct_id = b.acct_id
      and a.var1 = b.var2 
        ....
      ;

This should avoid that inner select which I suspect your time was being spent.

If that is too slow, then consider putting an index on acct_id in both tables. That should speed up the join.

Convert Proc Sql Exists Query to Data Step

2 Answers