0
votes

I have a small dataset based on a survey(about 80 obsv) & on which i want to perform a logistic regression using SAS.

My survey contains some variables (named X1,X2,X3) that i want to reunite as categories of a new created variable named X4.

The problem is that those variables X1-X3 already have categories (YES/NO/WITHOUT OPINION)

How can i reunite them as categories of X4 but with considering the values that they have ?

to help you understand my question :

Y(=1/0) = X1 X2 X3

X1-X3 each have 3 categories (YES/NO/WITHOUT OPINION)

What i want is :

Proc logistic data = have ; model Y = X4 and others such as age, city... but X4 can take 3 values.

The problem isn't creating X4 based on X1-X3 but how to affect X4 the values that X1-X3 each takes ?

(NB: i say X1-X3 but it's more)

I do this in SAS but even a theorical explanation would be helpful !

Thank you.

1
I don't think it would be a good idea to combine X1-X3 into a single variable. This is effectively converting all values into a single set of dummy variables where you can no longer add and test interaction effects.Stu Sztukowski
You need to make the rules how to combine the variables and that will depend on the questions and if there's a logical method. We cannot answer that for you. From a purely technical standpoint, you have options such as taking Yes if any are Yes, otherwise, No and without opinion as the secondary options. Or another alternative is to take the most frequent.Reeza
I get your point, both. But when i don't combine them my model is overfitting since i have a small dataset and too many variables. I thought that combining some of them under a "Determinants of Y" would solve the problem.iXXIX

1 Answers

2
votes

I think that the comments are right for the most part - this probably won't help your regression.

But - to answer how to literally do this; usually what you would do is to use powers of 2 (or 3).

So, for typical "yes/no" where you don't care about the 3rd one, you'd assign things like this:

x4 = (x1) + (x2 * 2) + (x3 * 4);

Then the values would be like this:

0 = (0,0,0)
1 = (1,0,0)
2 = (0,1,0)
3 = (1,1,0)
4 = (0,0,1)
5 = (1,0,1)
6 = (0,1,1)
7 = (1,1,1)

If you actually want the "no opinion" to be a category (this is complicated, but it's not ideal in many cases to include people with "no opinion" unless having an opinion is actually relevant, it's better to exclude them or to impute the value), then you would do this with powers of 3. It works the same way as the powers of 2, you just have a lot more category combinations (27 total).

x4 = (x1) + (x2 * 3) + (x3 * 9);

Just make sure they're 0/1/2 coded, not 1/2/3; if they're 1/2/3 then subtract one during the multiplication.


What else can you do that's better? You can do a bunch of things theoretically that are superior to this actual categorization (which really doesn't help your overfitting at all).

One term that's helpful is "collapsing"; see for example this paper by Bruce Lund et al for example (Plug: Bruce is giving a (not free) class in regression for WUSS later this month. You can use ANOVA to analyze which variables contribute to your model. You can use some other procedures like GLMSELECT as well; this is a major topic in regression in general.

You could also look into factor analysis, like in this SAS Book excerpt.