1
votes

I have a dataset with n levels of id and (n+4) variables. I wish to perform a regression for each of the n levels of the categorical variable, using the values of the n-1 variables as explanatory variables. Here is my dataset:

data have;
    input s id $ x z y00 y01 y02;
cards;
1 00  95 5.00 .02 .43 .33  
2 00 100 5.50 .01 .44 .75
3 00 110 5.25 .10 .37 .34
4 00  97 5.00 .02 .43 .33  
5 00 100 5.50 .01 .43 .75
6 00 120 5.25 .10 .38 .47
7 00  95 5.00 .02 .43 .35  
8 00 130 5.50 .01 .44 .75
9 00 110 5.25 .10 .39 .44
10 00  85 5.00 .02 .43 .33  
11 00 110 5.50 .01 .47 .78
12 00 110 5.25 .10 .37 .44
1 01 20 6.00 .22 .01 .66
2 01 25 5.95 .43 .10 .20
3 01 70 4.50 .88 .05 .17
1 02 80 2.50 .65 .33 .03
2 02 85 3.25 .55 .47 .04
3 02 90 2.75 .77 .55 .01
;
run;

So I wish to use z, y01, and y02 to explain x for ID 00. Similarly, z, y00, and y02 will explain x for ID 01. Lastly, z, y00, and y01 will explain x for ID 02.

I can just use a 'BY' statement, but I can't think of how to tell the model to ignore the variable with the same prefix as the ID I am currently working with.

I could create separate datasets, but n>100 for some of these analyses.

Ideally, I would run a proc mixed and a proc reg for every ID as described above, and have a dataset with the parameters for each.

Any ideas?

proc mixed data=have(where=(id='00')) plots(only)=all method=REML nobound ;
    class s;
    model x=z y01 y02
    / solution;
    random z y01 y02;
run;


proc reg data=have(where=(id='00'));
    model x=z y01 y02;
run;

Thanks.

1
You can solve this with a macro solution repeatedly running the analysis without creating different datasets (an on the fly WHERE and MODEL), but I'd like to see if someone knows a better solution; repeated macro solution is going to be time and disk intensiveJoe

1 Answers

2
votes

I don't know of any way to do this without data manipulation unfortunately, but here are two possible approaches that spring to mind.

Option 1. Copy the required independent variables into new variables.

/* Count the number of y variables */
proc sql noprint;
    select max(input(id, best.)) + 1
    into :dimY
    from have;
quit;

data alsoHave;
    set have;
    /* Create an array for indexing the y variables */
    array y[&dimY.] y:;
    /* Create new variables to contain y values */
    array newy[%eval(&dimY.-1)];
    _j = 1;
    do _i = 1 to &dimY.;
        /* Put each y value in a newy variable where it isn't  y_id  */
        if input(id, best.) + 1 ~= _i then do;
            newy[_j] = y[_i];
            _j + 1;
        end;
    end;
    drop _:;
run;

proc reg data = alsoHave;
    by id;
    model x = z newy:;
run;

Option 2. Reduce the variance of the unwanted variables to 0 so they won't affect the regression.

data alsoHave;
    set have;
    /* Create an array for indexing the y variables */
    array y[*] y:;
    _i = input(id, best.) + 1;
    backUp = y[_i];
    /* Overwrite the unwanted variables with 0 */
    y[_i] = 0;
    drop _:;
run;

proc reg data = alsoHave;
    by id;
    model x = z y:;
run;

I prefer the simplicity of Option 2, but array programming is fun so I included Option 1 anyway.

Edit: The below is an id agnostic version of Option 2, not requiring consecutive integers. The interesting functions are dim() which returns the number of variables in an array, and vname() which returns the variable name from an array and index. Finally compress() is used with the k (keep) and d (digits) options. Similar changes could be made to Option 1.

data alsoHave;
    set have;
    /* Create an array for indexing the y variables */
    array y[*] y:;
    /* Loop through the y variables */
    do _i = 1 to dim(y);
        /* Replace with 0  when the variable name matches the id */
        if input(compress(vname(y[_i]), , "dk"), best.) = input(id, best.) then y[_i] = 0;
    end;
    drop _:;
run;