Multinomial logit model in R on grouped data, data conversion and mlogit set-up

Question

I want to estimate the parameters of a multinomial logit model in R and wondered how to correctly structure my data. I’m using the “mlogit” package.

The purpose is to model people's choice of transportation mode. However, the dataset is a time series on aggregated level, e.g.:

This data must be reshaped from grouped count data to ungrouped data. My approach is to make three new rows for every individual, so I end up with a dataset looking like this:

For every individual's choice in the grouped data I make three new rows and use chid to tie these three rows together. I now want to run : mlogit.data(MyData, choice = “choice”, chid.var = “chid”, alt.var = “mode”).

Is this the correct approach? Or have I misunderstood the purpose of the chid function?

I do understand that this was originally posted on Cross Validated, and the standards might be different there, but on SO never, never, NEVER post an image of your data. This is less than useless, and kind of annoying actually. Import the data into R and post the output of, e.g. dput(mydata). That way, it's easy for us to import and manipulate it. — jlhoward

jlhoward jlhoward · Accepted Answer · 2015-09-12T21:25:11

It's too bad this was migrated from stats.stackexchange.com, because you probably would have gotten a better answer there.

The mlogit package expects data on individuals, and can accept either "wide" or "long" data. In the former there is one row per individual indicating the mode chosen, with separate columns for every combination for the mode-specific variables (time and price in your example). In the long format there is are n rows for every individual, where n is the number of modes, a second column containing TRUE or FALSE indicating which mode was chosen for each individual, and one additional column for each mode-specific variable. Internally, mlogit uses long format datasets, but you can provide wide format and have mlogit transform it for you. In this case, with just two variables, that might be the better option.

Since mlogit expects individuals, and you have counts of individuals, one way to deal with this is to expand your data to have the appropriate number of rows for each mode, filling out the resulting data.frame with the variable combinations. The code below does that:

df.agg <- data.frame(month=1:4,car=c(3465,3674,3543,4334),bus=c(1543,2561,2432,1266),bicycle=c(453,234,123,524))
df.lvl <- data.frame(mode=c("car","bus","bicycle"), price=c(120,60,0), time=c(5,10,30))

get.mnth <- function(mnth) data.frame(mode=rep(names(df.agg[2:4]),df.agg[mnth,2:4]),month=mnth)
df <- do.call(rbind,lapply(df.agg$month,get.mnth))
cols <- unlist(lapply(df.lvl$mode,function(x)paste(names(df.lvl)[2:3],x,sep=".")))
cols <- with(df.lvl,setNames(as.vector(apply(df.lvl[2:3],1,c)),cols))
df <- data.frame(df, as.list(cols))
head(df)
#   mode month price.car time.car price.bus time.bus price.bicycle time.bicycle
# 1  car     1       120        5        60       10             0           30
# 2  car     1       120        5        60       10             0           30
# 3  car     1       120        5        60       10             0           30
# 4  car     1       120        5        60       10             0           30
# 5  car     1       120        5        60       10             0           30
# 6  car     1       120        5        60       10             0           30

Now we can use mlogit(...)

library(mlogit)
fit <- mlogit(mode ~ price+time|0 , df, shape = "wide", varying = 3:8)
summary(fit)
#...
# Frequencies of alternatives:
#  bicycle      bus      car 
# 0.055234 0.323037 0.621729 
# 
# Coefficients :
#         Estimate Std. Error t-value  Pr(>|t|)    
# price  0.0047375  0.0003936  12.036 < 2.2e-16 ***
# time  -0.0740975  0.0024303 -30.489 < 2.2e-16 ***
# ...
coef(fit)["time"]/coef(fit)["price"]
#      time 
# -15.64069

So this suggests the reducing travel time by 1 (minute?) is worth about 15 (dollars)?

This analysis ignores the month variable. It's not clear to me how you would incorporate that, as month is neither mode-specific nor individual specific. You could "pretend" that month is individual-specific, and use a model formula like : mode ~ price+time|month, but with your dataset the system is computationally singular.

To reproduce the result from the other answer, you can use mode ~ 1|month with reflevel="car". This ignores the mode-specific variables and just estimates the effect of month (relative to mode = car).

There's a nice tutorial on mlogit here.

Multinomial logit model in R on grouped data, data conversion and mlogit set-up

2 Answers