2
votes

Say I have two data sets A and B that have identical variables and want to rank values in B based on values in A, not B itself (as "PROC RANK data=B" does.)

Here's a simplified example of data sets A, B and want (the desired output):

A:
obs_A  VAR1  VAR2  VAR3
1    10    100   2000
2    20    300   1000
3    30    200   4000
4    40    500   3000
5    50    400   5000

B:
obs_B  VAR1  VAR2  VAR3
1    15    150   2234
2    14    352   1555
3    36    251   1000
4    41    350   2011
5    60    553   5012

want:
obs  VAR1  VAR2  VAR3
1    2     2     3
2    2     4     2
3    4     3     1
4    5     4     3
5    6     6     6

I come up with a macro loop that involves PROC RANK and PROC APPEND like below:

%macro MyRank(A,B);
  data AB; set &A &B; run;
  %do i=1 %to 5;
    proc rank data=AB(where=(obs_A ne . OR obs_B=&i) out=tmp;
      var VAR1-3;
    run;
    proc append base=want data=tmp(where=(obs_B=&i) rename=(obs_B=obs)); run;
  %end;
%mend;

This is ok when the number of observations in B is small. But when it comes to very large number, it takes so long and thus wouldn't be a good solution.

Thanks in advance for suggestions.

2

2 Answers

2
votes

I would create formats to do this. What you're really doing is defining ranges via A that you want to apply to B. Formats are very fast - here assuming "A" is relatively small, "B" can be as big as you like and it's always going to take just as long as it takes to read and write out the B dataset once, plus a couple read/writes of A.

First, reading in the A dataset:

data ranking_vals;
input obs_A  VAR1  VAR2  VAR3;
datalines;
1    10    100   2000
2    20    300   1000
3    30    200   4000
4    40    500   3000
5    50    400   5000
;;;;
run;

Then transposing it to vertical, as this will be the easiest way to rank them (just plain old sorting, no need for proc rank).

data for_ranking;
  set ranking_vals;
  array var[3];
  do _i = 1 to dim(var);
    var_name = vname(var[_i]);
    var_value = var[_i];
    output;
  end;
run;

proc sort data=for_ranking;
  by var_name var_value;
run;

Then we create a format input dataset, and use the rank as the label. The range is (previous value -> current value), and label is the rank. I leave it to you how you want to handle ties.

data for_fmt;
  set for_ranking;
  by var_name var_value;
  retain prev_value;
  if first.var_name then do;   *initialize things for a new varname;
    rank=0;
    prev_value=.;
    hlo='l';                   *first record has 'minimum' as starting point;
  end;
  rank+1;
  fmtname=cats(var_name,'F');  
  start=prev_value;            
  end=var_value;
  label=rank;
  output;
  if last.var_name then do;       *For last record, some special stuff;
    start=var_value;
    end=.;
    hlo='h';
    label=rank+1;
    output;                       * Output that 'high' record;
    start=.;
    end=.;
    label=.;
    hlo='o';
    output;                       * And a "invalid" record, though this should never happen;
  end;
  prev_value=var_value;           * Store the value for next row.;
run;


proc format cntlin=for_fmt;
quit;

And then we test it out.

data test_b;
input obs_B  VAR1  VAR2  VAR3;
var1r=put(var1,var1f.);
var2r=put(var2,var2f.);
var3r=put(var3,var3f.);
datalines;
1    15    150   2234
2    14    352   1555
3    36    251   1000
4    41    350   2011
5    60    553   5012
;;;;
run;
0
votes

One way that you can rank by a variable from a separate dataset is by using proc sql's correlated subqueries. Essentially you counts the number of lower values in the lookup dataset for each value in the data to be ranked.

proc sql;
    create table want as
    select 
        B.obs_B, 
        (
            select count(distinct A.Var1) + 1
            from A
            where A.var1 <= B.var1.
        ) as var1
    from B;
quit;

Which can be wrapped in a macro. Below, a macro loop is used to write each of the subqueries. It looks through the list of variable and parametrises the subquery as required.

%macro rankBy(
        inScore /*Dataset containing data to be ranked*/, 
        inLookup /*Dataset containing data against which to rank*/, 
        varID /*Variable by which to identify an observation*/, 
        varsRank /*Space separated list of variable names to be ranked*/, 
        outData /*Output dataset name*/);
    /* Rank variables in one dataset by identically named variables in another */
    proc sql;
        create table &outData. as
        select 
            scr.&varID.
            /* Loop through each variable to be ranked */
            %do i = 1 %to %sysfunc(countw(&varsRank., %str( )));
                /* Store the variable name in a macro variable */
                %let var = %scan(&varsRank., &i., %str( ));
                /* Rank: count all the rows with lower value in lookup */
                , (
                    select count(distinct lkp&i..&var.) + 1
                    from &inLookup. as lkp&i.
                    where lkp&i..&var. <= scr.&var.
                ) as &var.
            %end;
        from &inScore. as scr;
    quit;
%mend rankBy;

%rankBy(
    inScore = B,
    inLookup = A,
    varID = obs_B,
    varsRank = VAR1 VAR2 VAR3,
    outData = want);

Regarding speed, this will be slow if your A is large, but should be okay for large B and small A.

In rough testing on a slow PC I saw:

A: 1e1    B: 1e6    time: ~1s
A: 1e2    B: 1e6    time: ~2s
A: 1e3    B: 1e6    time: ~5s
A: 1e1    B: 1e7    time: ~10s
A: 1e2    B: 1e7    time: ~12s
A: 1e4    B: 1e6    time: ~30s

Edit: As Joe points out below the length of time the query takes depends not just on the number of observations in the dataset, but how many unique values exist within the data. Apparently SAS performs optimisations to reduce the comparisons to only the distinct values in B, thereby reducing the number of times the elements in A need to be counted. This means that if the dataset B contains a large number of unique values (in the ranking variables) the process will take significantly longer then the times shown. This is more likely to happen if your data is not integers as Joe demonstrates.


Edit: Runtime test rig:

data A;
    input obs_A  VAR1  VAR2  VAR3;
datalines;
1    10    100   2000
2    20    300   1000
3    30    200   4000
4    40    500   3000
5    50    400   5000
;
run;
data B;
    do obs_B = 1 to 1e7;
        VAR1 = ceil(rand("uniform")* 60);
        VAR2 = ceil(rand("uniform")* 500);
        VAR3 = ceil(rand("uniform")* 6000);
        output;
    end;
run;
%let start = %sysfunc(time());
%rankBy(
    inScore = B,
    inLookup = A,
    varID = obs_B,
    varsRank = VAR1 VAR2 VAR3,
    outData = want);
%let time = %sysfunc(putn(%sysevalf(%sysfunc(time()) - &start.), time12.2));
%put &time.;

Output:

0:00:12.41