Re-uploading since there was some problems with my last post, and I did not know that we were supposed to post sample data. I'm fairly new to SAS, and I have a problem that I know how to solve in Excel but not SAS. however, the dataset is too large to reasonably use in Excel.
I have four variables: id, year_start, groupname, test_score.
Sample data:
id year_start group_name test_score
1 19931231 Red 90
1 19941230 Red 89
1 19951231 Red 91
1 19961231 Red 92
2 19930630 Red 85
2 19940629 Red 87
2 19950630 Red 95
3 19950931 Blue 90
3 19960931 Blue 90
4 19930331 Red 95
4 19940331 Red 97
4 19950330 Red 98
4 19960331 Red 95
5 19931231 Red 96
5 19941231 Red 97
My goal is to achieve a ranked list (fractional) by test_score for each year. I hoped that I would be able to achieve this using PROC RANK FRACTION. This function would calculate order by a test_score (highest is 1, 2nd highest is 2 and so on) and then divide by the total number of observations to provide a fractional rank. Unfortunately, year_start differs widely from row to row. For each id/year combo, I want to perform a one-year look-back from year-start, and rank that observation compared to all other id's that have a year_start in that one year range. I'm not interested in comparing by calendar year, and the rank of each id should be relative to its own year_start. Adding another level of complication, I would like this rank to be performed by groupname.
PROC SQL is totally fine if someone has a SQL solution.
Using the above data, the ranks would be like this:
id year_start group_name test_score rank
1 19931231 Red 90 0.75
1 19941230 Red 89 0.8
1 19951231 Red 91 1
1 19961231 Red 92 1
2 19930630 Red 85 1
2 19940629 Red 87 0.8
2 19950630 Red 95 0.75
3 19950931 Blue 90 1
3 19960931 Blue 90 1
4 19930331 Red 95 1
4 19940331 Red 97 0.2
4 19950330 Red 98 0.2
4 19960331 Red 95 0.333
5 19931231 Red 96 0.25
5 19941231 Red 97 0.667
In order to calculate the rank for row 1,
- we first exclude blue observations.
- Then we count the number of observations that fall within a year before that year_start, 19931231 (so we have 4 observations).
- We count how many of these observations have a higher test_score, and then add 1 to find the order of the current observation (So it is the 3rd highest).
- Then, we divide the order by the total number to get the rank (3/4= 0.75).
In Excel, the formula for this variable would look something like this. Assume formula is for row 1 and there are 100 rows. id=A, year_start=B, groupname=C, and test_score=D:
=(1+countifs(D1:D100,">"&D1,
B1:B100,"<="&B1,
B1:B100,">"&B1-365.25,
C1:C100, C1))/
countifs(B1:B100,"<="&B1,
B1:B100,">"&B1-365.25,
C1:C100, C1)
Thanks so much for the help!
ahammond428