1
votes

I am looking to run a probit model in R setting certain coefficients equal to each other.

Consider the simple example where four teams play each other once at home and once on the road:

Home <- c('NY','NY','NY','LA','LA','LA','BOS','BOS','BOS','CHI','CHI','CHI')
Away <- c('LA','CHI','BOS','NY','CHI','BOS','LA','CHI','NY','LA','NY','BOS')
HomeWin <- c(1,1,0,1,0,1,0,1,0,0,0,1)
results <- data.frame(Home,Away,HomeWin)

Suppose I want to run a probit model where I include dummy variables for the home team and the away team.

model <- glm(HomeWin ~ as.factor(Home) + as.factor(Away), family = binomial(link="probit"), data = results)

The result of the model provides coefficient estimates for three of the home teams (compared to an excluded home team) and three of the away teams (compared to an excluded away team). Suppose I want to set the model such that the home coefficient estimate for NY is equal to the away coefficient estimate for NY (and the same for the other cities). How would I do this? My full data contains 30 of these groups and with significantly more variables.

2
did you mean for HomeWin to be the response variable? - Ben Bolker
Yes that's correct! I'm making the update now. - Jeremy Losak
Just to make certain I'd understand correctly, you'd want the beta_home to be equal to beta_away for the equivalent factor levels? Not with inverted signs or something? - Oliver

2 Answers

3
votes

If I understand the question correctly, what you are actually looking for is to have home and away to have opposite effects. Eg. beta_{home=NY} = - beta_{away=NY}. It is not completely clear however. But a simple way of achieving this, would be to manually design your dummy variables, such that you have a dummy for NY_home_or_away with home=1 and away=-1. In this case beta_NY_home_or_away would be based on both home and away but have a negative sign.

library(dplyr)

competitors <- unique(unlist(results[, c('Home', 'Away')]))
new_cols <- lapply(competitors, function(x){
  home <- results[['Home']] == x
  away <- results[['Away']] == x
  case_when(home ~ 1, 
            away ~ -1,
            TRUE ~ 0)
})
names(new_cols) <- competitors
results_wide <- bind_cols(results, new_cols)

fit <- glm(HomeWin ~ NY + LA + CHI + BOS, data = results_wide, family = binomial('probit'))
summary(fit)

Call:
glm(formula = HomeWin ~ NY + LA + CHI + BOS, family = binomial("probit"), 
    data = results_wide)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.64597  -0.73997   0.01633   1.19731   1.19731  

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.927e-02  3.823e-01  -0.077    0.939
NY           6.786e-01  6.676e-01   1.017    0.309
LA           6.786e-01  6.676e-01   1.017    0.309
CHI         -2.898e-16  6.527e-01   0.000    1.000
BOS                 NA         NA      NA       NA

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 16.636  on 11  degrees of freedom
Residual deviance: 14.537  on  8  degrees of freedom
AIC: 22.537

Number of Fisher Scoring iterations: 5

Note that now the sign is dependent on the sign of whether the team is Away and Home as Away=-1. Also any statistical test should likely be done with some care after performing such transformation, as their interpretation and validity will be dependent on other variables. Also note that one team will be get NA estimates, as the dummies are linearly dependent.

2
votes

You can create dummy variables for each teamname being listed as either Home or Away and use those dummies in the regression.

(The example below may numerically perform oddly given the sample data you provided but it should work with the real data.)


library(dplyr)
library(fastDummies)

teams <- results$Home %>% unique()

# function to add a dummy for a given team is either Home or Away 
add_HoA <- function(df, team) {
  HoA_str <- paste0('HoA_',team)
  HoA <- ensym(HoA_str)
  
  df <- df %>% mutate(!!HoA := (Home ==team | Away==team) %>% as.integer())
  return (df)
}

for (team in teams) {
  results <- add_HoA(results, team)
}

# using HoA_ variables for all teams  
model2 <- glm(HomeWin ~ ., family = binomial(link="probit"), 
              data = results %>% dplyr::select(HomeWin, starts_with('HoA_')))
summary(model2)

results <- fastDummies::dummy_cols(results, select_columns = c('Home','Away'))

# using HoA_ variables for NY
model3 <- glm(HomeWin ~ ., family = binomial(link="probit"), 
              data = results %>%
                dplyr::select(HomeWin, HoA_NY, starts_with('Home_'), starts_with('Away_')) %>%
                dplyr::select(-Home_NY, -Away_NY))
summary(model3)

# using HoA_ variables for BOS
model4 <- glm(HomeWin ~ ., family = binomial(link="probit"), 
              data = results %>%
                dplyr::select(HomeWin, HoA_BOS, starts_with('Home_'), starts_with('Away_')) %>%
                dplyr::select(-Home_BOS, -Away_BOS))
summary(model4)