replicating a sql function in sas datastep

votes

Hi another quick question

in proc sql we have on which is used for conditional join is there something similar for sas data step

for example

proc sql;
....
data1 left join data2
on first<value<last
quit;

can we replicate this in sas datastep

data work.combined
  set data1(in=a) data2(in=b)

   if a then output;
run;

sasproc-sqldatastep

Are the keys on table data1 unique? – Jon Clements

no the left join does not have a 1-1 match – user3135774

@arcoder sorry - I rephrased the comment... – Jon Clements

What tables are first/value/last coming from? – Jon Clements

data1 contains first,last data2 contains value – user3135774

4 Answers

votes

You can also can reproduce sql join in one DATA-step using hash objects. It can be really fast but depends on the size of RAM of your machine since this method loads one table into memory. So the more RAM - the larger dataset you can wrap into hash. This method is particularly effective for look-ups in relatively small reference table.

data have1;
    input first last;
datalines;
1 3
4 7
6 9
;
run;

data have2;
    input value;
datalines;
2
5
6
7
;
run;

data want;
    if _N_=1 then do;
        if 0 then set have2;
        declare hash h(dataset:'have2');
        h.defineKey('value');
        h.defineData('value');
        h.defineDone();
        declare hiter hi('h');
    end;
    set have1;

    rc=hi.first();
    do while(rc=0);
        if first<value<last then output;
        rc=hi.next();
    end;
    drop rc;
run;

The result:

value  first  last
2       1       3
5       4       7
6       4       7
7       6       9

votes

Yes there is a simple (but subtle) way in just 7 lines of code.

What you intend to achieve is intrinsically a conditional Cartesian join which can be done by a do-looped set statement. The following code use the test dataset from Dmitry and a modified version of the code in the appendix of SUGI Paper 249-30

data data1;
    input first last;
datalines;
1 3
4 7
6 9
;
run;

data data2;
    input value;
datalines;
2
5
6
7
;
run;

/***** by data step looped SET *****/
DATA CART_data; 
    SET data1; 
    DO i=1 TO NN; /*NN can be referenced before set*/
        SET data2 point=i nobs=NN; /*point=i - random access*/
        if first<value<last then OUTPUT; /*conditional output*/
    END; 
RUN;

/***** by SQL *****/
proc sql;
    create table cart_SQL as 
    select * from data1 
    left join data2
        on first<value<last;
quit;

One can easily see that the results coincide.

Also note that from SAS 9.2 documentation: "At compilation time, SAS reads the descriptor portion of each data set and assigns the value of the NOBS= variable automatically. Thus, you CAN refer to the NOBS= variable BEFORE the SET statement. The variable is available in the DATA step but is not added to any output data set."

votes

There isn't a direct way to do this with a MERGE. This is one example where the SQL method is clearly superior to any SAS data step methods, as anything you do will take much more code and possibly more time.

However, depending on the data, it's possible a few approaches may make sense. In particular, the format merge.

If data1 is fairly small (even, say, millions of records), you can make a format out of it. Like so:

data fmt_set;
set data1;
format label $8.;
start=first;  *set up the names correctly;
end=last;
label='MATCH';
fmtname='DATA1F';
output;
if _n_=1 then do; *put out a hlo='o' line which is for unmatched lines;
start=.; *both unnecessary but nice for clarity;
end=.;
label='NOMATCH';
hlo='o';
output;
end;
run;

proc format cntlin=fmt_set; *import the dataset;
quit;

data want;
set data2;
if put(value,DATA1F.)="MATCH";
run;

This is very fast to run, unless data1 is extremely large (hundreds of millions of rows, on my system) - faster than a data step merge, if you include sort time, since this doesn't require a sort. One major limitation is that this will only give you one row per data2 row; if that is what is desired, then this will work. If you want repeats of data2 then you can't do it this way.

If data1 may have overlapping rows (ie, two rows where start/end overlap each other), you also will need to address this, since start/end aren't allowed to overlap normally. You can set hlo="m" for every row, and "om" for the non-match row, or you can resolve the overlaps.

I'd still do the sql join, however, since it's much shorter to code and much easier to read, unless you have performance issues, or it doesn't work the way you want it to.

votes

Here's another solution, using a temporary array to hold the lookup dataset. Performance is probably similar to Dmitry's hash-based solution, but this should also work for people still using versions of SAS prior to 9.1 (i.e. when hash objects were first introduced).

I've reused Dmitry's sample datasets:

data have1;
    input first last;
datalines;
1 3
4 7
6 9
;
run;

data have2;
    input value;
datalines;
2
5
6
7
;
run;

/*We need a macro var with the number of obs in the lookup dataset*/
/*This is so we can specify the dimension for the array to hold it*/
data _null_;
     if 0 then set have2 nobs = nobs;
     call symput('have2_nobs',put(nobs,8.));
     stop;
run;

 data want_temparray;
    array v{&have2_nobs} _temporary_;
    do _n_ = 1 to &have2_nobs;
        set have2 (rename=(value=value_array));
        v{_n_}=value_array;
    end;
    do _n_ = 1 by 1 until (eof_have1);
        set have1 end = eof_have1;
        value=.;
        do i=1 to &have2_nobs;
            if first < v{i} < last then do;
                value=v{i};
                output;
            end;
        end;
        if missing(value) then output;
    end;
    drop i value_array;
 run;

Output:

value  first  last
2       1       3
5       4       7
6       4       7
7       6       9

This matches the output from the equivalent SQL:

proc sql;
create table want_sql as 
    select * from
    have1 left join have2
    on first<value<last
    ;
quit;
run;