4
votes

I am estimating a fairly simple McFadden choice model using a very large data set (101.6 million unit-alternatives). I can estimate this model just fine in Stata using the asclogit command, but when I try to use the mlogit package in R, I get the following error:

region1 <- mlogit(chosen ~ mean_log.wage + mean_log.rent + bornNear + Dim.1 + regionFE | 0,
                  shape= "long", chid.var = "chid", alt.var = "alternatives", data = ready)

Error in qr.default(na.omit(X)) : too large a matrix for LINPACK
Calls: mlogit ... model.matrix -> model.matrix.mFormula -> qr -> qr.default

If I look at the source code of qr.R it's clear that the number of elements in my design matrix is too big relative to the LINPACK limit of 2,147,483,647. However, no such limit exists for LAPACK (that I can tell, at least).

From qr.R:

qr.default <- function(x, tol = 1e-07, LAPACK = FALSE, ...)
{
    x <- as.matrix(x)
    if(is.complex(x))
        return(structure(.Internal(La_qr_cmplx(x)), class = "qr"))
    ## otherwise :
    if(LAPACK)
        return(structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr"))
    ## else "Linpack" case:
    p <- as.integer(ncol(x))
    if(is.na(p)) stop("invalid ncol(x)")
    n <- as.integer(nrow(x))
    if(is.na(n)) stop("invalid nrow(x)")
    if(1.0 * n * p > 2147483647) stop("too large a matrix for LINPACK")
    ...

qr() appears to be called in the mFormula method of mlogit, when model.matrix is being created, and probably while checking NAs. But I can't tell if there is a way to pass LAPACK = TRUE to mlogit, or if there is a way to skip the NA checking.

I'm hoping @YvesCroissant will see this.

As I mentioned, I can estimate this model just fine in Stata, so it's not a question of resources. My Stata license is not portable, however, which is why I would like to use R.

1
I suppose you meant passing LAPACK = TRUE. A little hard to help without being able to reproduce the issue (could we simply generate lots of data for that?). Couple of things: 1) you may want to look into the RStata package allowing to get Stata's output to R, 2) you could define your own function, identical to mlogit:::model.matrix.mFormula except for the qr(na.omit(X)) call at the end, where you could add LAPACK = TRUE, and then assignInNamespace("model.matrix.mFormula", myfun, ns = "mlogit") should override it.Julius Vainora
@JuliusVainora I had also thought to hard-code in LAPACK = TRUE in the appropriate place, but wasn't sure where to start with that. Your comment is helpful! Not sure if it would be useful to generate lots of data to reproduce this problem, since not many people have access to 50+GB of RAM, which is what it would take. I may come up with something and edit my post, though.Tyler R.

1 Answers

3
votes

Thanks to Julius' comment and this post on namespaces in R, I figured out the answer. I added the following code right after my library statements:

source("mymFormula.R")
tmpfun <- get("model.matrix.mFormula", envir = asNamespace("mlogit"))
environment(mymFormula) <- environment(tmpfun)
attributes(mymFormula) <- attributes(tmpfun)  # don't know if this is really needed
assignInNamespace("model.matrix.mFormula", mymFormula, ns="mlogit")

mymFormula.R is an R script where I copy/pasted the contents of mlogit:::model.matrix.mFormula and added mymFormula <- before the function invocation at the top of the file.

I viewed the contents of mlogit:::model.matrix.mFormula by typing trace(mlogit:::model.matrix.mFormula, edit=TRUE) in RStudio. (Thanks to this answer for help on how to do that.)