1
votes

A programming + Statistics Question:

Context: I am currently building a model simulation (an agent-based model) where each agent (read: person) as a series of variables (i.e. gender, race, martial status, income bracket, education, etc).

This is not a homework question, it's a problem I am trying to solve for work so I do not have to hard code everything, and will make implementing changes to my model much easier and faster.

The variables essentially break down as follows:

gender: 0 = female, 1 = male
race:   1 = white, 2 = black, 3 = hispanic, 4 = other
marital status: 1 = married, 2 = divorced, 3 = no married
income: 1 = <20k, 2 = 20k-75k, 3= 75+k
education:  1 = <HS, 2 = HS, 3 = >HS

In my dataset I want to predict, for example, smoking status (0 = non-smoker, 1 = smoker).
Easy, do a logistic regression. Programming in the main effects would not be too difficult since the population model would be as follows:

SmokingStatus = b_0 + b_1(gender1) + b_2(race2) + b_3(race3) + b_4(race4) + ... + e

Problem 1: As you can see from the equation above, categorical variables create k-1 dummy variables. Essentially the stats program will create the following dummy variables (using race as an example): race2, race3, race4. and each will have a beta estimate (that is the ln(OR) relative to the reference group, race1).

Question 1: How would I write my java program to calculate the probability of smoking status from the regression output (the tables I have are SAS ouputs), without creating the corresponding dummy variables in my agent class.

Problem 2: This problem gets even worse when I have interaction terms in my model, since the parameter estimates are the combinations of each variable's dummy-variable. For example, in above population model + an interaction term between gender and race would be:

SmokingStatus = b_0 + b_1(gender1) + b_2(race2) + b_3(race3) + b_4(race4) + B_5(gender1race2) + B_6(gender1race3) + B_7(gender1race4) ... + e

Question 2: Given this added complexity, what would be the best approach?

My ultimate goal: I am trying to write a java program that will take in a (csv) file of variables and their parameter estimates, and essentially 'plug in the values' to generate a probability for my response variable (e.g. smoking status).

Yes I know after I plug in all the values I will have to transform my answer via:

Math.exp(logitP)/(1 + Math.exp(logitP))

My current (and terrible) solution involves initializing all the dummy variables to 0, then doing a series of if statements to assign a value of 1, then multiplying all the dummies by the corresponding beta estimate (many of the terms will equate to 0)

for example:

    int race2 = 0;
    int race3 = 0;
    int race4 = 0;
    int sex0 = 0;

    // race
    if (alcoholAgent.getRace() == 2) {race2 = 1;}
    else if (alcoholAgent.getRace() == 3) {race3 = 1;}
    else if (alcoholAgent.getRace() == 4) {race4 = 1;}

    // sex female is reference group == 0
    if (alcoholAgent.getGender() == 1) {sex0 = 1;}

    // age2-6_race2-4
    if ((alcoholAgent.getAgeCat() == 2) && (alcoholAgent.getRace()==2)) {age2race2 = 1;}
    else if ((alcoholAgent.getAgeCat() == 2) && (alcoholAgent.getRace()==3)) {age2race3 = 1;}
    else if ((alcoholAgent.getAgeCat() == 2) && (alcoholAgent.getRace()==4)) {age2race4 = 1;}

    else if ((alcoholAgent.getAgeCat() == 3) && (alcoholAgent.getRace()==2)) {age3race2 = 1;}
    else if ((alcoholAgent.getAgeCat() == 3) && (alcoholAgent.getRace()==3)) {age3race3 = 1;}
    else if ((alcoholAgent.getAgeCat() == 3) && (alcoholAgent.getRace()==4)) {age3race4 = 1;}

    else if ((alcoholAgent.getAgeCat() == 4) && (alcoholAgent.getRace()==2)) {age4race2 = 1;}
    else if ((alcoholAgent.getAgeCat() == 4) && (alcoholAgent.getRace()==3)) {age4race3 = 1;}
    else if ((alcoholAgent.getAgeCat() == 4) && (alcoholAgent.getRace()==4)) {age4race4 = 1;}

    else if ((alcoholAgent.getAgeCat() == 5) && (alcoholAgent.getRace()==2)) {age5race2 = 1;}
    else if ((alcoholAgent.getAgeCat() == 5) && (alcoholAgent.getRace()==3)) {age5race3 = 1;}
    else if ((alcoholAgent.getAgeCat() == 5) && (alcoholAgent.getRace()==4)) {age5race4 = 1;}

    else if ((alcoholAgent.getAgeCat() == 6) && (alcoholAgent.getRace()==2)) {age6race2 = 1;}
    else if ((alcoholAgent.getAgeCat() == 6) && (alcoholAgent.getRace()==3)) {age6race3 = 1;}
    else if ((alcoholAgent.getAgeCat() == 6) && (alcoholAgent.getRace()==4)) {age6race4 = 1;}
1

1 Answers

1
votes

Any model which makes use of the numerical values of categorical variables is misleading at best. In what sense is race=2 "greater than" race=1? In no sense, of course. My advice is to dump the logistic regression.

Since there is no real ordering of the categorical variables, the best you can do is a look-up table. Just make a multidimensional table indexed by the categorical variables, and count up examples which fall into each bin in the table to find the proportional of examples in each output category. That proportion is your probability of the output category for that combination of input variables.

A look-up table takes all interactions of variables into account. The disadvantage is that the number of table elements may be very large. You may be able to compute the probability of the output category as a product of probabilities from smaller tables (i.e., with fewer indices per table). This is what is called a "naive Bayes" model; it makes an assumption that input variables (or groups of them) are independent given the output category.