3
votes

I am working on replicating a SAS code into a R code and I came across the following SAS code snippet -

proc means data=A noprint;
by name date; 
id comp_no;
var price; 
id rep_dats act no;
output out= test(drop=_type_ _freq_)        
median=median n=num; 
run;

I know that the 'by' statement is used to group by to give statistics at that level. But, what is 'id' used for? Why are there two 'id' statements? I checked out SAS help but I didn't really understand it. I also checked out their examples at http://support.sas.com/documentation/cdl/en/proc/65145/HTML/default/viewer.htm#p19dfq16fqt1t3n1eroiabnn6r3s.htm. But there was no example illustrating the use of ID.

As I don't have access to SAS, I can't try this out and see how the output looks like. Any clarifications would be of great help to me. Thanks!

2

2 Answers

6
votes

The proc means procedure can calculate and display simple summary statistics of a data set and output that summary statistics. By default, it summarizes numeric variables (columns) by analyzing every numeric variable in the data set.

By using ID statement with by in a proc means it will produce a one value per group. This one value is the greatest value of the first variable specified in ID within the by group. Thus, if you specify many variables, e.g. id A B; It will output the only greatest value of A for that group.

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146733.htm

By the way, I don't know how your data set looks like, but it seems like your proc means is only summarizing the price variable.

For example, if you have a data set:

                        Obs    sex     A      B    C     D

                         1      M      20    50    1    34
                         2      F     500    45    3    45
                         3      M     200    23    7    32
                         4      M     120    67    5    44
                         5      F     400    98    2    59

then

proc means data=sorted;
by sex;
var A B;
id D C;
output out=means(drop =_type_ _freq_);
run;

will output:

                          sex     D    C    _STAT_       A          B

                           F     59    2     N          2.000     2.0000
                           F     59    2     MIN      400.000    45.0000
                           F     59    2     MAX      500.000    98.0000
                           F     59    2     MEAN     450.000    71.5000
                           F     59    2     STD       70.711    37.4767
                           M     44    5     N          3.000     3.0000
                           M     44    5     MIN       20.000    23.0000
                           M     44    5     MAX      200.000    67.0000
                           M     44    5     MEAN     113.333    46.6667
                           M     44    5     STD       90.185    22.1886

Note that in variable D, 59 is the greatest value of D in group F, but C is not because D was specified first. It is the similar case for Group M as well where C is just the number that was on the same row as the greatest value of D.

1
votes

It allows you to add columns to the output other than the columns in the class and var statements. This makes sense if the id variable is constant across each class combination; otherwise sas returns the largest value within each combination of classes. See here:

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146733.htm