1
votes

Problem:

I have a dataset with hundreds of variables (columns) and I want to standardize all numeric variables. But instead of center and dividing by just one standard deviation, I need to center and divide all variables by two standard deviations.

This is an example of the dataset I have

data have;
INPUT year $1-4 program_id $6-8 program_name $10-31 enrollments 33-36 admissions 38-41 graduates 43-46;
datalines;
2010 002 Electrical Engineering 1563 0321 0156
2010 001 Civil Engineering      2356 0739 0236
2010 003 Mechanical Engineering 0982 0234 0069
2010 021 English                3945 1034 0269
2010 031 Physics                0459 0134 0069
2010 041 Arts                   0234 0072 0045
2019 004 Engineering            4745 1202 0597
2019 022 English Teaching       2788 0887 0201
2019 023 English and Spanish    0751 0345 0092
2019 031 Physics                0589 0126 0039
2019 032 Astronomy              0093 0035 0021
2019 041 Arts                   0359 0097 0062
2019 044 Cinema                 0293 0100 0039
;
run;

I want two different datasets. In the first, standardization applies for all variables across the whole dataset.

proc sql;
create table want1 as
select *, 
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have;
quit;

In the second, standardization is grouped by year:

proc sql;
create table want2 as
select *, 
(enrollments - mean(enrollments))/(2*STD(enrollments)) as z_enrollments,
(admissions - mean(admissions))/(2*STD(admissions)) as z_admissions,
(graduates - mean(graduates))/(2*STD(graduates)) as z_graduates
from have
group by year;
quit;

Question: How to do this for all the hundreds of numeric variables of my dataset, without needing to write down the name of each one of them?

What I tried:

As I want this code to be replicable to different datasets, I was trying to follow the reasoning of this other question. That is, first to identify all numeric variables, than to save all variables names into an array and them doing the computations. I thought that perhaps I also need to save the resulting parameters of each column (mean and std) in an array as well. But I still did not get how to make arrays, datasteps and loops to work together.

I started trying to set an array for calculating the number of numerical variables. This runs fine.

data _null_;
set have;
array x[*] _numeric_;
call symput("nVar",dim(x));
stop;
run;

%put Number Variables = &nVar;

Then I tried to adapt the following code - which is a combination of @DomPazz answer with @Tom suggestion in the comments - but it did not work:

data want;
set have nobs=nobs;
array x[&nVar] _numeric_;
array N[&nVar];

n(1)=x(1); do i=2 to dim(n); n(i)=(x(i) - mean(x(i))/(2*(STD(x(i)); end;

keep N:;
run;

I don't know if the above code would get the right result. But I get an error saying that I have the incorrect number of arguments for the STD function. I looked it up: apparently, STD() in datastep runs row-wise, not column-wise.

I also tried PROC STANDARD, I get some results, but they don't match with my calculations. Probably I did not set the parameters right:

proc standard data=have mean=0 std=2
 out=want;       
 run;
2
What procs have you tried?Tom
@Tom, I added some of the codes I tried in my question. I didn't do it at the first itme because I didn't want to make it TL;DR.LuizZ

2 Answers

4
votes

You can use the METHED=STD on PROC STDIZE to standardize around the mean and one STD. So just add the MULT= option to divide by 2.

proc stdize data=have method=STD mult=0.5 out=want;
run;
1
votes

Answering last comment:

@Tom I was reading the proc stdize documentation, but I could not figure out if I can customize the LOCATION and SCALE measures. For example, if instead of dividing by 2sdt, I want to subtract the mean and divide by the range for all variables. Would it be possible?

Quick solution:

* Output Mean;
proc stdize data=have method=mean out=out1 outstat=mean1;
var _numeric_;
run;

* Output Range;
proc stdize data=have method=range out=out1 outstat=range1;
var _numeric_;
run;

* LOCATION and SCALE;
data scale_location;
set mean1 (where=(_type_='LOCATION')) range1 (where=(_type_='SCALE'));
run;

* Target; 
proc stdize data=have method=in(scale_location) out=want;
var _numeric_;
run;