1
votes

I have this proc sql query in my current code. Unfortunately, I'm dealing with over 10 million records, so it takes hours to run. I've been trying to convert it to a data step, thinking it would run much quicker. However, I can't seem to get the same data results. If anyone can help me with the data step I'd greatly appreciate it. Or if you have suggestions on how I can make the proc sql run more efficiently.

Here is my proc sql query:

proc sql;
  create table test as
  select *
  from table1 a
  where exists (select 1
                from table2 b
                where b.acct_id = a.acct_id);
quit;

This is the data step I tried converting it to:

proc sort data=table1; by acct_id; run;
proc sort data=table2; by acct_id; run;

data test;
  merge table1   (in=a)
        table2   (in=b);
  by acct_id;
  if a and b;
run;
2
Does select * from table1 where acct_id in (select acct_id from table2) perform any faster? (That query might make the intent more clear to the query planner - hopefully saving the need for indexing/sorting/joining)Jon Clements
At least in SAS 9.3 the proc sql is quite bad at optimizing IN/EXIST statements.Jetzler
I just did a small scale test with your suggestion and it seemed substantially faster (11 minutes vs 32 seconds). I'm going to try a full run now and see how it goes.UncleCross

2 Answers

0
votes

Try an inner join in SQL. You will have to list out each of the variables that need to match.

create table test as
select *
    from
        table1 as a
      inner join
        table2 as b
      on a.acct_id = b.acct_id
      and a.var1 = b.var2 
        ....
      ;

This should avoid that inner select which I suspect your time was being spent.

If that is too slow, then consider putting an index on acct_id in both tables. That should speed up the join.

0
votes

As to why your current data step doesn't work is most likely because you have duplicate keys on table2 (which will distort the observations where there's a 1-N or N-N merge). If you amend the sort to only keep the keys and remove duplicates then the merge should give the expected results.

proc sort data=table1; by acct_id; run;
proc sort data=table2 (keep=acct_id) out=wanted_accounts nodupkey; by acct_id; run;

data test;
merge table1 (in=a)
      wanted_accounts (in=b);
by acct_id;
if a and b;
run;