0
votes

I've encountered a problem where I need to analyze the relationship between a movie's length, a movie's price and it's sale on a video streaming platform. Now I have two choices to quantify sale as my dependent variable:

  1. whether or not a user ended up buying the movie
  2. selling rate (# of people buying the movie / # of people watched the trailer)

if I use selling rate I essentially would use a linear regression where I have selling rate= beta_0 + beta_1*length + beta_2*price + beta_3*length*price

But if I'm asked to use option 1 where my response is a binary output, and I assume I need to switch to logistic regression, how would the standard error change? Will the standard error be an underestimate?

1
You may also apply linear regression on a binary outcome, which is called a Linear Probability Model, i.e. you get handsome probabilities as a result. Anyway, purely statistical questions like this should be asked on Cross Validated.jay.sf
you should use a glm (i'd try a quasibinomial logistic regression first) in both cases, since the response is restricted in both cases. For the second example you could alternatively also use a beta regression.SweetSpot
Even if you switch to a logistic regression model, you are still making some unwarranted assumptions. The rate of people buying a movie is unlikely to be linearly related to its length. If you find that the rate of purchase is 60% for a two-hour movie and 40% for a one-hour movie, would you happily conclude that 20% would purchase a zero-length movie? A non-linear model might be more realisticAllan Cameron

1 Answers

0
votes

Your SE will be on a different scale but if you have a large effect with the continuous outcome there is a solid chance that you will get the same inferences with the binary logistic. The logistic is "throwing away" nearly all the variability in the responses so it has relatively low power. As SweetSpot said you should treat this a a glm problem because of the restrictions in the range on the outcome. That is, you don't want a model that can give you negative counts/rates. Also the variance estimates need care. Consider using glm with family = binomial for the yes/no sold outcome and family = poisson for the count/rate. The UCLA web pages for logistic, poisson and negative binomial regression are a great place to start. Probably the best book for people who want clean writing without proofs is Agresti's Introduction to Categorical Data Analysis.