0
votes

I have a variable I created based on a certain data. now in this new data I need to calculate different statistic parameters, but with conditions for example: *median of this new var only for obs that their birth country is not Italy. *mean of a different var only when age>35, *Q1 and Q3 of 2 types of the same var (Female and Male for example) and so on. do I use the PROC FREQ or the PROC MEANS- because it includes all these stats? either way this is not working for me..how can I reform this procedure on a single var from data?

proc means data=dat2;
where "birth_country" NE Italy";
run;

proc means data dat2;
where Mage>=35;
run;
2

2 Answers

0
votes

I wouldn't create a separate data step as suggested by Alex A. That can be a bad habit to develop as, with large datasets, it can be extremely costly in terms of CPU.

Rather, I would subset the Proc Means call but slightly differently from Alex A's suggestion since you probably don't want the output generated from a "by" statement since the 'by' statement requires the data be sorted by the 'by' variables (another costly CPU mistake to minimize):

proc means data=dat2(where=(country ne 'ITALY')) 
           median n mean q1 q3 noprint nway;
var NewVar;
output out=median1(drop=_:) n=n median=median mean=mean q1=q1 q3=q3;
run;
proc print data=median1;
run;

proc means data=dat2(where=(age>35)) n median mean q1 q3 noprint nway;
var DiffVar;
class sex;
output out=median2(drop=_:) n=n median=median mean=mean q1=q1 q3=q3;
run;
proc print data=median2;
run;

The 'noprint' option suppresses SAS writing the output to the listing file.

The 'nway' option suppresses inclusion of the automatic _ type_ variables that are generated for sex -- the class variable (as Alex A. notes, SAS would produce three levels or _ type_ variables for each requested metric: 2 for gender and one overall).

The 'drop=_:' statement strips out any variable with an underscore in the first character. For Proc Means, this would include the automatic variables _ type_ and _ freq_ as well as any other variable in the dataset that began with an underscore.

Adding the 'n' option to the Proc Means call gives you the frequency of each subset for the class variable where the _ freq_ variable only gives you the overall sample of nonmissing information and does not break that down by levels of the class statement.

Alternatively, you can read the data into the Proc Means calls with a 'where' statement. I'm not sure but my impression is that subsetting the data with the 'data=' call is more computationally efficient. I'm deducing this from the general SAS rule about avoiding executable statements and keeping 'if,' 'where' and other commands at the level of the PDV (program data vector) insofar as this is possible:

proc means data=dat2 median n mean q1 q3 noprint nway;
var NewVar;
where country ne 'ITALY';
output out=median1(drop=_:) n=n median=median mean=mean q1=q1 q3=q3;
run;
0
votes

You can set unwanted values to missing in the input dataset since missing values are ignored by proc means. Then you can use all desired variables in a single var statement.

Assuming your variables to be summarized are called var1 and var2, you can do it like this:

data input;
    set dat2;
    if birth_country = "Italy" then call missing(var1);
    if age <= 35 then call missing(var2);
run;

proc means data = input mean median q1 q3;
    class sex;
    var var1 var2;
run;

Using class sex; will give you results by sex and overall.

If that seems too difficult to follow, you can just break it down into multiple calls to proc means.

proc means data = dat2 mean median q1 q3;
    class sex;
    var var1;
    where birth_country ^= "Italy";
run;

proc means data = dat2 mean median q1 q3;
    class sex;
    var var2;
    where age > 35;
run;

For more specific details on the various statements and options available in proc means, take a look at the SAS documentation for that procedure.