0
votes

I am very confused about the package Zelig and in particular the function sim. What i want to do is estimate a logistic regression using a subset of my data and then estimate the fitted values of the remaining data to see how well the estimation performs. Some sample code follows:

data(turnout)

turnout <- data.table(turnout)

Shuffle the data

turnout <- turnout[sample(.N,2000)]

Create a sample for regression

turnout_sample <- turnout[1:1800,]

Create a sample for out of data testing

turnout_sample2 <- turnout[1801:2000,]

Run the regression

z.out1 <- zelig(vote ~ age + race, model = "logit", data = turnout_sample)

summary(z.out1) Model:

Call: z5$zelig(formula = vote ~ age + race, data = turnout_sample)

Deviance Residuals: Min 1Q Median 3Q Max
-1.9394 -1.2933 0.7049 0.7777 1.0718

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.028874 0.186446 0.155 0.876927 age 0.011830 0.003251 3.639 0.000274 racewhite 0.633472 0.142994 4.430 0.00000942

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 2037.5  on 1799  degrees of freedom

Residual deviance: 2002.9 on 1797 degrees of freedom AIC: 2008.9

Number of Fisher Scoring iterations: 4

Next step: Use 'setx' method

Set the x values to the remaining 200 observations

x.out1 <- setx(z.out1,fn=NULL,data=turnout_sample2)

Simulate

s.out1 <- sim(z.out1,x=x.out1)

Get the fitted values

fitted <- s.out1$getqi("ev")

What i don't understand is that the list fitted now contains 1000 values and all the values are between 0,728 and 0,799. 1. Why are there 1000 values when what I am trying to estimate is the fitted value of 200 observations? 2. And why are the observations so closely grouped?

I hope someone can help me with this.

Best regards

1

1 Answers

0
votes

The first question:
From the signature of sim (sim(obj, x = NULL, x1 = NULL, y = NULL, num = 1000..) you see the default number of simulations is 1000. If you want to have 200, set num=200.

However, the sim in this example from documentation you use, actually generates (simulates) the probability that a person will vote given certain values (either computed by setx or computed and fixed on some value like this setx(z.out, race = "white")).

So in your case, you have 1000 simulated probability values between 0,728 and 0,799, which is what you are supposed to get.