0
votes

I want to estimate the parameters of a multinomial logit model in R and wondered how to correctly structure my data. I’m using the “mlogit” package.

The purpose is to model people's choice of transportation mode. However, the dataset is a time series on aggregated level, e.g.:

enter image description here

This data must be reshaped from grouped count data to ungrouped data. My approach is to make three new rows for every individual, so I end up with a dataset looking like this:

enter image description here

For every individual's choice in the grouped data I make three new rows and use chid to tie these three rows together. I now want to run : mlogit.data(MyData, choice = “choice”, chid.var = “chid”, alt.var = “mode”).

Is this the correct approach? Or have I misunderstood the purpose of the chid function?

2
I do understand that this was originally posted on Cross Validated, and the standards might be different there, but on SO never, never, NEVER post an image of your data. This is less than useless, and kind of annoying actually. Import the data into R and post the output of, e.g. dput(mydata). That way, it's easy for us to import and manipulate it.jlhoward

2 Answers

3
votes

It's too bad this was migrated from stats.stackexchange.com, because you probably would have gotten a better answer there.

The mlogit package expects data on individuals, and can accept either "wide" or "long" data. In the former there is one row per individual indicating the mode chosen, with separate columns for every combination for the mode-specific variables (time and price in your example). In the long format there is are n rows for every individual, where n is the number of modes, a second column containing TRUE or FALSE indicating which mode was chosen for each individual, and one additional column for each mode-specific variable. Internally, mlogit uses long format datasets, but you can provide wide format and have mlogit transform it for you. In this case, with just two variables, that might be the better option.

Since mlogit expects individuals, and you have counts of individuals, one way to deal with this is to expand your data to have the appropriate number of rows for each mode, filling out the resulting data.frame with the variable combinations. The code below does that:

df.agg <- data.frame(month=1:4,car=c(3465,3674,3543,4334),bus=c(1543,2561,2432,1266),bicycle=c(453,234,123,524))
df.lvl <- data.frame(mode=c("car","bus","bicycle"), price=c(120,60,0), time=c(5,10,30))

get.mnth <- function(mnth) data.frame(mode=rep(names(df.agg[2:4]),df.agg[mnth,2:4]),month=mnth)
df <- do.call(rbind,lapply(df.agg$month,get.mnth))
cols <- unlist(lapply(df.lvl$mode,function(x)paste(names(df.lvl)[2:3],x,sep=".")))
cols <- with(df.lvl,setNames(as.vector(apply(df.lvl[2:3],1,c)),cols))
df <- data.frame(df, as.list(cols))
head(df)
#   mode month price.car time.car price.bus time.bus price.bicycle time.bicycle
# 1  car     1       120        5        60       10             0           30
# 2  car     1       120        5        60       10             0           30
# 3  car     1       120        5        60       10             0           30
# 4  car     1       120        5        60       10             0           30
# 5  car     1       120        5        60       10             0           30
# 6  car     1       120        5        60       10             0           30

Now we can use mlogit(...)

library(mlogit)
fit <- mlogit(mode ~ price+time|0 , df, shape = "wide", varying = 3:8)
summary(fit)
#...
# Frequencies of alternatives:
#  bicycle      bus      car 
# 0.055234 0.323037 0.621729 
# 
# Coefficients :
#         Estimate Std. Error t-value  Pr(>|t|)    
# price  0.0047375  0.0003936  12.036 < 2.2e-16 ***
# time  -0.0740975  0.0024303 -30.489 < 2.2e-16 ***
# ...
coef(fit)["time"]/coef(fit)["price"]
#      time 
# -15.64069 

So this suggests the reducing travel time by 1 (minute?) is worth about 15 (dollars)?

This analysis ignores the month variable. It's not clear to me how you would incorporate that, as month is neither mode-specific nor individual specific. You could "pretend" that month is individual-specific, and use a model formula like : mode ~ price+time|month, but with your dataset the system is computationally singular.

To reproduce the result from the other answer, you can use mode ~ 1|month with reflevel="car". This ignores the mode-specific variables and just estimates the effect of month (relative to mode = car).

There's a nice tutorial on mlogit here.

0
votes

Are price and time real variables that you're trying to make a part of the model?

If not, then you don't need to "unaggregate" that data. It's perfectly fine to work with counts of the outcomes directly (even with covariates). I don't know the particulars of doing that in mlogit but with multinom, it's simple, and I imagine it's possible with mlogit:

# Assuming your original data frame is saved in "df" below
library(nnet)
response  <- as.matrix(df[,c('Car', 'Bus', 'Bicycle')])
predictor <- df$Month

# Determine how the multinomial distribution parameter estimates
# are changing as a function of time
fit <- multinom(response ~ predictor)

In the above case the counts of the outcomes are used directly with one covariate, "Month". If you don't care about covariates, you could also just use multinom(response ~ 1) but it's hard to say what you're really trying to do.

Glancing at the "TravelMode" data in the mlogit package and some examples for it though, I do believe the options you've chosen are correct if you really want to go with individual records per person.