1
votes

I'm having some difficulty with running a logistic regression in R using glm. There are two ways to pass the binary response variable into glm to perform a logistic regression. You can pass the data to glm in a serial data format (e.g. one row per observation, the response variable either being a 0 or a 1, and the independent variable taking on whatever value you have for it), or you can pass it in as a table, with a minimum of three columns: the first column indicating the number of trials, the second column indicating the number of successes, and the third being the independent variable.

When I use glm using the latter data format (e.g. a data frame with three columns), I get the expected output, but when I enter the data using the former (i.e. serial data format) I don't get the expected answer.

Here's an example

prices <- c(89.99, 99.99, 149.99)
non_purchases <- c(11907, 2024, 5046)
purchases <- c(1369, 215, 31)
trials <- cbind(non_purchases, purchases)

model <- glm(trials ~ prices, family=binomial(link="logit"))

> summary(model)

Call:
glm(formula = trials ~ prices, family = binomial)

Deviance Residuals: 
     1       2       3  
 1.332  -4.440   1.553  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.923863   0.241677   -7.96 1.71e-15 ***
prices       0.044995   0.002593   17.35  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 715.832  on 2  degrees of freedom
Residual deviance:  23.897  on 1  degrees of freedom
AIC: 49.228

Number of Fisher Scoring iterations: 4

I get the expected value out in this case, however with serial data

> head(atable)
  ordered sale_price
1       0     149.99
2       0     149.99
3       0     149.99
4       0     149.99
5       0     149.99
6       0     149.99
> summary(atable)
    ordered          sale_price    
 Min.   :0.00000   Min.   : 89.99  
 1st Qu.:0.00000   1st Qu.: 89.99  
 Median :0.00000   Median : 89.99  
 Mean   :0.07843   Mean   :105.87  
 3rd Qu.:0.00000   3rd Qu.: 99.99  
 Max.   :1.00000   Max.   :149.99 

> conv_model <- glm(ordered ~ sale_price, family=binomial(link="logit"), data=atable)
> summary(conv_model)

Call:
glm(formula = ordered ~ sale_price, family = binomial(link = "logit"), 
    data = atable)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4743  -0.4743  -0.4743  -0.1209   3.1376  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.549136   0.095341    5.76 8.43e-09 ***
sale_price  -0.019949   0.001002  -19.90  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11322  on 20591  degrees of freedom
Residual deviance: 10623  on 20590  degrees of freedom
AIC: 10627

Number of Fisher Scoring iterations: 7

And just to show that it's the same data

> table(atable$ordered, atable$sale_price)

    89.99 99.99 149.99
  0 11907  2024   5046
  1  1369   215     31

The output I get is totally different, and I'm totally confused. Can anyone help me out? I assume I'm doing something simple

1

1 Answers

3
votes

I think your issue is that you are switching the definition of "success".

From ?glm (emphasis mine)

For binomial and quasibinomial families the response can also be specified as... a two-column matrix with the columns giving the numbers of successes and failures.

So the first column is "successes". In your code, you use cbind(non_purchases, purchases), which makes non_purchases the "success" column. But in your table, non-purchases are coded as 0 for failure. With the code below, we get identical results:

prices <- c(89.99, 99.99, 149.99)
non_purchases <- c(11907, 2024, 5046)
purchases <- c(1369, 215, 31)
trials <- cbind(non_purchases, purchases)

dd = data.frame(
    price = c(rep(prices, non_purchases), rep(prices, purchases)),
    purchase = c(rep(0, sum(non_purchases)), rep(1, sum(purchases)))
)

coef(glm(purchase ~ price, data = dd, family = "binomial"))
# (Intercept)       price 
#  1.92386320 -0.04499477 

coef(glm(cbind(purchases, non_purchases) ~ prices, family = "binomial"))
# (Intercept)       price 
#  1.92386320 -0.04499477