I am very confused about the package Zelig and in particular the function sim. What i want to do is estimate a logistic regression using a subset of my data and then estimate the fitted values of the remaining data to see how well the estimation performs. Some sample code follows:
data(turnout)
turnout <- data.table(turnout)
Shuffle the data
turnout <- turnout[sample(.N,2000)]
Create a sample for regression
turnout_sample <- turnout[1:1800,]
Create a sample for out of data testing
turnout_sample2 <- turnout[1801:2000,]
Run the regression
z.out1 <- zelig(vote ~ age + race, model = "logit", data = turnout_sample)
summary(z.out1) Model:
Call: z5$zelig(formula = vote ~ age + race, data = turnout_sample)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9394 -1.2933 0.7049 0.7777 1.0718
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.028874 0.186446 0.155 0.876927 age 0.011830 0.003251 3.639 0.000274 racewhite 0.633472 0.142994 4.430 0.00000942
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2037.5 on 1799 degrees of freedom
Residual deviance: 2002.9 on 1797 degrees of freedom AIC: 2008.9
Number of Fisher Scoring iterations: 4
Next step: Use 'setx' method
Set the x values to the remaining 200 observations
x.out1 <- setx(z.out1,fn=NULL,data=turnout_sample2)
Simulate
s.out1 <- sim(z.out1,x=x.out1)
Get the fitted values
fitted <- s.out1$getqi("ev")
What i don't understand is that the list fitted now contains 1000 values and all the values are between 0,728 and 0,799. 1. Why are there 1000 values when what I am trying to estimate is the fitted value of 200 observations? 2. And why are the observations so closely grouped?
I hope someone can help me with this.
Best regards