1
votes

I have a data set in SAS that has multiple columns that have missing data. This post replaces all the missing values in the entire data set with zeros. But since it goes through the entire data set you can't just replace the zero with the mean or median for that column. How do I replace missing data with the mean of that column?

There are only 5 or so columns so the script doesn't need to go through the entire data set.

3

3 Answers

2
votes

PROC STDIZE has an option to do just this. The REPONLY option tells it you want it to only replace missing values, and METHOD=MEAN tells it how you want to replace those values. (PROC EXPAND also could be used, if you are using time series data, but if you're just using mean, STDIZE is the simpler one.)

For example:

data missing_class;
  set sashelp.class;
  if _N_=5 then call missing(age);
  if _N_=7 then call missing(height);
  if _N_=9 then call missing(weight);
run;

proc stdize data=missing_class out=imputed_class
            method=mean reponly;
    var age height weight;
run;
0
votes

Ideally, you would want to use PROC MI to do multiple imputation and get a more accurate representation of missing values; however, if you wish to use the average, and alternate way of doing so can be done with PROC MEANS and a data step.

/* Set up data */
data have(index=(sex) );
    set sashelp.class;

    if(_N_ IN(3,7,9,12) ) then call missing(height);
run;

/* Calculate mean of all non-missing values */
proc means data=have noprint;
    by sex;
    output out=means mean(height) = imp_height;
run;

/* Merge avg. values with original data */
data want;
    merge have
          means;
    by sex;

    if(missing(height) ) then height = imp_height;

    drop imp_height;
run;
0
votes

You can use the mean function in proc sql to replace only the missing observations in each column:

data temp;
   input var1 var2 var3 var4 var5;
   datalines;
    . 2 3 4 .
    6 7 8 9 10
    . 12 . . 15
    16 17 18 19 .
    21 . 23 24 25
    ;
run;

proc sql;
    create table temp2 as select
        case when missing(var1) then mean(var1) else var1 end as var1,
        case when missing(var2) then mean(var2) else var2 end as var2,
        case when missing(var3) then mean(var3) else var3 end as var3,
        case when missing(var4) then mean(var4) else var4 end as var4,
        case when missing(var5) then mean(var5) else var5 end as var5
        from temp;
quit;

And, as Joe mentioned, you can use coalesce instead if you prefer that syntax:

coalesce(var1, mean(var1)) as var1