0
votes

Re-uploading since there was some problems with my last post, and I did not know that we were supposed to post sample data. I'm fairly new to SAS, and I have a problem that I know how to solve in Excel but not SAS. however, the dataset is too large to reasonably use in Excel.

I have four variables: id, year_start, groupname, test_score.

Sample data:

id     year_start     group_name     test_score
1       19931231          Red            90
1       19941230          Red            89
1       19951231          Red            91
1       19961231          Red            92
2       19930630          Red            85
2       19940629          Red            87
2       19950630          Red            95
3       19950931          Blue           90
3       19960931          Blue           90
4       19930331          Red            95
4       19940331          Red            97
4       19950330          Red            98
4       19960331          Red            95
5       19931231          Red            96
5       19941231          Red            97

My goal is to achieve a ranked list (fractional) by test_score for each year. I hoped that I would be able to achieve this using PROC RANK FRACTION. This function would calculate order by a test_score (highest is 1, 2nd highest is 2 and so on) and then divide by the total number of observations to provide a fractional rank. Unfortunately, year_start differs widely from row to row. For each id/year combo, I want to perform a one-year look-back from year-start, and rank that observation compared to all other id's that have a year_start in that one year range. I'm not interested in comparing by calendar year, and the rank of each id should be relative to its own year_start. Adding another level of complication, I would like this rank to be performed by groupname.

PROC SQL is totally fine if someone has a SQL solution.

Using the above data, the ranks would be like this:

id     year_start     group_name     test_score     rank
1       19931231          Red            90         0.75
1       19941230          Red            89          0.8
1       19951231          Red            91           1
1       19961231          Red            92           1
2       19930630          Red            85           1
2       19940629          Red            87          0.8
2       19950630          Red            95         0.75
3       19950931          Blue           90           1
3       19960931          Blue           90           1
4       19930331          Red            95           1
4       19940331          Red            97          0.2
4       19950330          Red            98          0.2
4       19960331          Red            95         0.333
5       19931231          Red            96         0.25
5       19941231          Red            97         0.667

In order to calculate the rank for row 1,

  • we first exclude blue observations.
  • Then we count the number of observations that fall within a year before that year_start, 19931231 (so we have 4 observations).
  • We count how many of these observations have a higher test_score, and then add 1 to find the order of the current observation (So it is the 3rd highest).
  • Then, we divide the order by the total number to get the rank (3/4= 0.75).

In Excel, the formula for this variable would look something like this. Assume formula is for row 1 and there are 100 rows. id=A, year_start=B, groupname=C, and test_score=D:

      =(1+countifs(D1:D100,">"&D1, 
                B1:B100,"<="&B1,
                B1:B100,">"&B1-365.25,
                C1:C100, C1))/
       countifs(B1:B100,"<="&B1,
                B1:B100,">"&B1-365.25,
                C1:C100, C1) 

Thanks so much for the help!

ahammond428

2
Can you explain the logic for the rank function? Maybe do some example calculations. Please read How-to-Ask And here is a great place to START to learn how improve your question quality and get better answers.Juan Carlos Oropeza
Your example is wrong. In the first row, the rank would be 3 of 4, since it is the 3rd highest red score within a year. Also there's no date of Sept. 31st.Talmage
Oh wow, you're right, my apologies. In the actual data, lower is better, and I forgot to update the excel formula and outputs to match the sample I provided. Updated..ahammond428

2 Answers

1
votes

Your example isn't correct if I'm reading it correctly, so it's hard to know exactly what you're trying to do. But try the following and see if it works. You may need to tweak inequalities to be open or closed depending on whether you want to include one year to the date. Note that your year_start column needs to be imported in a SAS date format for this to work. Otherwise you can change it over with input(year_start, yymmdd8.).

proc sql;
select distinct
    a.id,
    a.year_start,
    a.group_name,
    a.test_score,
    1+sum(case when b.test_score > a.test_score then 1 else 0 end) as rank_num,
    count(b.id) as rank_denom,
    calculated rank_num / calculated rank_denom as rank
from testdata a left join testdata b
    on a.group_name = b.group_name
    and intnx('year',a.year_start,-1,'s') le b.year_start le a.year_start
group by a.id, a.year_start, a.group_name, a.test_score
order by id, year_start;
quit;

Note that I changed dates of 9/31 to 9/30 (since there is no 9/31), but left 3/30, 6/29, and 12/30 alone since perhaps that was intended, though the other dates seem to be quarter-end.

0
votes

Consider correlated count subqueries in SQL:

DATA

data ranktable;   
    infile datalines missover;  
    input id year_start group_name $ test_score; 
    datalines; 
1       19931231          Red            90
1       19941230          Red            89
1       19951231          Red            91
1       19961231          Red            92
2       19930630          Red            85
2       19940629          Red            87
2       19950630          Red            95
3       19950930          Blue           90
3       19960930          Blue           90
4       19930331          Red            95
4       19940331          Red            97
4       19950330          Red            98
4       19960331          Red            95
5       19931231          Red            96
5       19941231          Red            97
; 
run;

data ranktable;
    set ranktable;          
    format year_start date9.;
    year_start = input(put(year_start,z8.),yymmdd8.);
run;

PROC SQL

Additional fields included for your review

proc sql;
    select r.id, r.year_start, r.group_name, r.test_score, 
           put(intnx('year', r.year_start, -1, 's'), yymmdd10.) as year_ago,
           (select count(*) from ranktable sub 
            where sub.test_score >= r.test_score
            and sub.group_name = r.group_name
            and sub.year_start <= r.year_start
            and sub.year_start >= intnx('year', r.year_start, -1, 's')) as num_rank,    
           (select count(*) from ranktable sub 
            where sub.group_name = r.group_name
            and sub.year_start <= r.year_start
            and sub.year_start >= intnx('year', r.year_start, -1, 's')) as denom_rank,    
           calculated num_rank / calculated denom_rank as rank
    from ranktable r;
run;

OUTPUT

You will notice a slight difference between your expected results which may be due to the quarter day (365.25) you apply for all years as SAS's intnx takes one full calendar year in days which change with each year

Proc SQL Output