Trying to measure performance on two small sets of data in order to determine an efficient execution method for a much larger pair of data sets.
*This test is being done on a dataset with 32 observations and a dataset with 37 observations.
Both methods give me identical results, slightly different process times. I have a simple data step :
data check;
merge d1(in=a) d2(in=b);
by ssn;
if a=0 and b=1;
run;
The Data Step method (1st execution) log produced the following -
NOTE: There were 32 observations read from the data set WORK.D1.
NOTE: There were 37 observations read from the data set WORK.D2.
NOTE: The data set WORK.CHECK has 5 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
The Proc SQL method (not exists query in our specific case) is below-
proc sql;
create table chck2 as
select b.* from d2 b
where not exists (select a.* from d1 a
where a.ssn=b.ssn)
;
quit;
The sql proc prints the following in the log -
NOTE: PROCEDURE SQL used (Total process time):
real time 0.04 seconds
cpu time 0.03 seconds
These methods both yield the same results, creating my final data set of the same 5 individuals. While the data step processing seems faster (even if only by fraction of a second), will these performance results ALWAYS hold true? Will the Data step method ALWAYS win? What are the key influencing factors here? Do listing the table in a certain order play a role, or would SAS scan both tables simultaneously?
FYI - I mentioned (1st execution) because I noticed from the experiment above, and general exposure, that if you process data steps subsequently, SAS will process subsequent steps faster than the original execution. Assuming this has something to do with SAS having memory on previously executed steps...?