Ranking values based on another data set in SAS

Question

Say I have two data sets A and B that have identical variables and want to rank values in B based on values in A, not B itself (as "PROC RANK data=B" does.)

Here's a simplified example of data sets A, B and want (the desired output):

A:
obs_A  VAR1  VAR2  VAR3
1    10    100   2000
2    20    300   1000
3    30    200   4000
4    40    500   3000
5    50    400   5000

B:
obs_B  VAR1  VAR2  VAR3
1    15    150   2234
2    14    352   1555
3    36    251   1000
4    41    350   2011
5    60    553   5012

want:
obs  VAR1  VAR2  VAR3
1    2     2     3
2    2     4     2
3    4     3     1
4    5     4     3
5    6     6     6

I come up with a macro loop that involves PROC RANK and PROC APPEND like below:

%macro MyRank(A,B);
  data AB; set &A &B; run;
  %do i=1 %to 5;
    proc rank data=AB(where=(obs_A ne . OR obs_B=&i) out=tmp;
      var VAR1-3;
    run;
    proc append base=want data=tmp(where=(obs_B=&i) rename=(obs_B=obs)); run;
  %end;
%mend;

This is ok when the number of observations in B is small. But when it comes to very large number, it takes so long and thus wouldn't be a good solution.

Thanks in advance for suggestions.

Joe Joe · Accepted Answer · 2015-03-23T02:03:42

I would create formats to do this. What you're really doing is defining ranges via A that you want to apply to B. Formats are very fast - here assuming "A" is relatively small, "B" can be as big as you like and it's always going to take just as long as it takes to read and write out the B dataset once, plus a couple read/writes of A.

First, reading in the A dataset:

data ranking_vals;
input obs_A  VAR1  VAR2  VAR3;
datalines;
1    10    100   2000
2    20    300   1000
3    30    200   4000
4    40    500   3000
5    50    400   5000
;;;;
run;

Then transposing it to vertical, as this will be the easiest way to rank them (just plain old sorting, no need for proc rank).

data for_ranking;
  set ranking_vals;
  array var[3];
  do _i = 1 to dim(var);
    var_name = vname(var[_i]);
    var_value = var[_i];
    output;
  end;
run;

proc sort data=for_ranking;
  by var_name var_value;
run;

Then we create a format input dataset, and use the rank as the label. The range is (previous value -> current value), and label is the rank. I leave it to you how you want to handle ties.

data for_fmt;
  set for_ranking;
  by var_name var_value;
  retain prev_value;
  if first.var_name then do;   *initialize things for a new varname;
    rank=0;
    prev_value=.;
    hlo='l';                   *first record has 'minimum' as starting point;
  end;
  rank+1;
  fmtname=cats(var_name,'F');  
  start=prev_value;            
  end=var_value;
  label=rank;
  output;
  if last.var_name then do;       *For last record, some special stuff;
    start=var_value;
    end=.;
    hlo='h';
    label=rank+1;
    output;                       * Output that 'high' record;
    start=.;
    end=.;
    label=.;
    hlo='o';
    output;                       * And a "invalid" record, though this should never happen;
  end;
  prev_value=var_value;           * Store the value for next row.;
run;


proc format cntlin=for_fmt;
quit;

And then we test it out.

data test_b;
input obs_B  VAR1  VAR2  VAR3;
var1r=put(var1,var1f.);
var2r=put(var2,var2f.);
var3r=put(var3,var3f.);
datalines;
1    15    150   2234
2    14    352   1555
3    36    251   1000
4    41    350   2011
5    60    553   5012
;;;;
run;

Ranking values based on another data set in SAS

2 Answers