1
votes

Trying to measure performance on two small sets of data in order to determine an efficient execution method for a much larger pair of data sets.

*This test is being done on a dataset with 32 observations and a dataset with 37 observations.

Both methods give me identical results, slightly different process times. I have a simple data step :

data check;
merge d1(in=a) d2(in=b);
by ssn;
if a=0 and b=1;
run;

The Data Step method (1st execution) log produced the following -

NOTE: There were 32 observations read from the data set WORK.D1.
NOTE: There were 37 observations read from the data set WORK.D2.
NOTE: The data set WORK.CHECK has 5 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

The Proc SQL method (not exists query in our specific case) is below-

proc sql;
create table chck2 as
select b.* from d2 b
where not exists (select a.* from d1 a
                  where a.ssn=b.ssn)
;
quit;

The sql proc prints the following in the log -

NOTE: PROCEDURE SQL used (Total process time):
      real time           0.04 seconds
      cpu time            0.03 seconds

These methods both yield the same results, creating my final data set of the same 5 individuals. While the data step processing seems faster (even if only by fraction of a second), will these performance results ALWAYS hold true? Will the Data step method ALWAYS win? What are the key influencing factors here? Do listing the table in a certain order play a role, or would SAS scan both tables simultaneously?

FYI - I mentioned (1st execution) because I noticed from the experiment above, and general exposure, that if you process data steps subsequently, SAS will process subsequent steps faster than the original execution. Assuming this has something to do with SAS having memory on previously executed steps...?

1

1 Answers

2
votes

You'll never find meaningful performance evaluations from small datasets. Overhead will inevitably whammy any sort of actual performance difference. PROC SQL has a bit of overhead involved in invoking the procedure (a few hundredths of a second), which is more than the total execution time. Run your test with large enough datasets that it takes minutes to run - usually that's the right balance between tests taking too long and legit differences being squashed by overhead/randomness.

As far as what would be faster: If the dataset is sorted, and SAS knows it's sorted, then the odds are very good that both processes will be in the same magnitude of time. Data step merge is quite fast, as is SQL merge.

If it's not sorted, SQL might (would probably) choose to turn the where-exists into a hash join, which would be much faster than sorting a large dataset. Of course that requires the dataset to fit into memory. Sorting and then merging in the data step might be the same as SQL, or it might be slower - or even faster, though I suspect usually not much faster if it requires sorting first. There are faster solutions in the data step than sort/merge if that's needed (hash or format).

As far as what the order on the PROC SQL statement is; odds are it won't matter, if SQL can figure out what you're doing and optimize it. However, it may because SQL may not easily see the optimal path, so one order (usually the large dataset as the main one and the smaller dataset as the subquery) may help SQL figure out the right approach more easily than the other.

And - the reason SAS has a faster time doing a second or later run is that your OS (or possibly your file system) is caching the read, so it doesn't have to re-read the SET file from disk.