1
votes

I have voting data in the form of counts for OutcomeA and counts for OutcomeB (there are only two outcomes). I am using the formulation of the glm binomial family of models as suggested here: GLM for proportion data in r ( https://stats.stackexchange.com/questions/89734/glm-for-proportion-data-in-r ) with the y variable being:

cbind (OutcomeA, OutcomeB)

I would like to use the caret package, to do some cross validation and generally handle the output for comparative purposes, as suggested here: Binomial GLM using caret train

I am right in thinking that I can use the vote for outcome A as the 'y' variable, and the total electorate turnout (ie OutcomeA + OutcomeB) as the weight variable? Thanks.

(edit) The (artificial) data looks like:

OutcomeA OutcomeB   X1   X2   X3   X4
    1234     2345 0.23 0.34 0.34 0.45
    2345     2312 0.55 0.57 0.58 0.58
    3423     1234 0.45 0.88 0.69 0.12
...

OutcomeA is the number of votes in favour and OutcomeB is the number against.

I want to model the 'quantity' OutcomeA/(OutcomeA+OutcomeB) as a function of X1, X2, X3 and X4 using a binomial family model in glm, via caret.

The splitting of data into training and testing data is not the issues here.

1
I think what you are asking is how to divide your data into training sets with two possible outcomes. Where there are only two A as one group, and the other group would be the total minus A, which in this case is simply B. You only need to use a weighting method for the training and test data if you have a extreme differences in the number of observations for each of the two variables. And even then if you have sufficiently large data it may not matter. If I misunderstood your intent, rephrase the question or show more data and I will try to be more helpful. - sconfluentus
Thanks. I have expanded the question slightly. - Stephen Clark

1 Answers

-1
votes

If you want to model the ratio or percent of A, you could just use a linear regression and with the percent as your outcome variable (create the percentage before feeding it into the equation). You would get a series of coefficients indicating the impact of each X variable on y with a y-intercept.

Currently your data is not binomial, that would require a binary outcome, yes no, A or B, win or lose. Converting to a ratio or percent means it is not Poisson either that would need to be a simple count, a singular one.

If your goal is to predict a percent, the I would create the percent in a new column (A/(A + B) and use the new column as the outcome, with a traditional linear regression:

mod<-lm(newPercent~x1+x2+x3+x4, data=df)

If you have been tasked (in a class or something) with learning to use the glm with family="binomial" on this data set, then the simplest way to do it would be to use an if statement to ascertain the winner, create a new column with categories A & B to represent who won. Then use a glm as follows:

mod <- glm(winner~x1+x2+x3+x4,data=df,family=binomial())

But it is not appropriate model to predict the percentage of votes for A, that would be a traditional linear model.

If you want to use the method from your first link, then you would be using: mod <- glm(cbind(outcomeA, outcomeB)~x1+x2+x3+x4,data=df,family=binomial(logit))

if you want to use the second link and are getting that error, using caret to manage the training and test sets, then you need to convert your outcome variables to a TWO LEVEL factor: A or B.

df$newCategory<-ifelse(outcomeA>outcomeB, A , B) df$newCategory<-as.factor(df$newCategory)

Then use the glm in Caret with train and it should be fine. IF you are still having issues post the code update again and I will try to help