Fast group-by simple linear regression

Question

This Q & A arises from How to make group_by and lm fast? where OP was trying to do a simple linear regression per group for a large data frame.

In theory, a series of group-by regression y ~ x | g is equivalent to a single pooled regression y ~ x * g. The latter is very appealing because statistical test between different groups is straightforward. But in practice doing this larger regression is not computationally easy. My answer on the linked Q & A reviews packages speedlm and glm4, but pointed out that they can't well address this problem.

Large regression problem is difficult, particularly when there are factor variables. This may explain why many people abandon this idea and prefer to splitting data by group and fitting models by group. There is no point for me to enumerate methods on group-by regression (see Linear Regression and group by in R). What I care is speed.

For simple linear regression as y ~ x | g, splitting data by group then relying on standard model fitting routines like lm is a performance killer. First of all, subsetting a large data frame is inefficient. Secondly, standard model fitting routines follow the procedure below, which are sheer overhead to the useful regression computation.

parse model formula to "terms" object (using terms.formula);
construct model frame (using model.frame.default);
build model matrix (using model.matrix.default).

There are clever computational tricky for simple linear regression. As I demonstrated in Fast pairwise simple linear regression between variables in a data frame, the covariance method is extremely fast. Can we adapt it to group-by simple linear regression via a group_by_simpleLM function?

If you want to do many simple linear regressions (possibly with some common covariables), you might want to consider using privefl.github.io/bigstatsr/reference/big_univLinReg.html — F. Privé

Zheyuan Li Zheyuan Li · Accepted Answer · 2018-08-24T00:41:21

We have to do this by writing compiled code. I would present this with Rcpp. Be aware that I am a C programmer and have been using R's conventional C interface. Rcpp is used just to ease handling of lists, strings and attributes, as well as to facilitate immediate testing in R. The code is largely written in C-style. Macros from R's conventional C interface like REAL and INTEGER are still used. See the bottom of this answer for "group_by_simpleLM.cpp".

The R wrapper function group_by_simpleLM has four arguments:

group_by_simpleLM <- function (dat, LHS, RHS, group) {
##.... [TRUNCATED]

dat is a data frame. If you feed a matrix or a list, it will stop and complain.
LHS is a character vector giving the names for the variables on the left hand side of ~. Multiple LHS variables are supported.
RHS is a character vector giving the names for the variables on the right hand side of ~. Only a single, non-factor RHS variable is allowed in simple linear regression. You can provide a vector of variables to RHS, but the function will only retain the first element (with a warning). If that variable is not found in dat (maybe because you've mistyped the name) or it is not a numerical variable, it will give you an informative error message.
group is a character vector giving the name of the grouping variable. It should best be readily a factor in dat, otherwise the function uses match(group, unique(group)) for a fast coercion and a warning is issued. A factor with unused levels is no harm. group_by_simpleLM_cpp sees this and returns you all NaN for such levels. group can be NULL so that a single regression is done for all data.

The workhorse function group_by_simpleLM_cpp returns a named list of matrices to hold regression results for each response. Each matrix is "wide" with nlevels(group) columns and 5 rows:

"alpha" (for intercept);
"beta" (for slope);
"beta.se" (for standard error of slope);
"r2" (for R-squared);
"df.resid" (for residual degree of freedom);

For a simple linear regression, these five statistics are sufficient to obtain other statistics.

The function watches out for rank-deficient case where there is only one datum in a group. Slope can not be estimated then and NaN is returned. Another special case is when a group only has two data. The fit is then perfect and you get 0 standard error for the slope.

The function is a fast method for nlme::lmList(RHS ~ LHS | group, dat, pool = FALSE) when group != NULL, and a fast method for lm(RHS ~ LHS, dat) when group = NULL (may even be faster than general_paired_simpleLM because it is written in C).

Caution:

weighted regression is not handled, because the covariance method is invalid in that case.
no check for NA / NaN / Inf / -Inf in dat is made and functions break in their presence.

Examples

library(Rcpp)
sourceCpp("group_by_simpleLM.cpp")

## a toy dataset
set.seed(0)
dat <- data.frame(y1 = rnorm(10), y2 = rnorm(10), x = 1:5,
                  f = gl(2, 5, labels = letters[1:2]),
                  g = sample(gl(2, 5, labels = LETTERS[1:2])))

group-by regression: a fast method of nlme::lmList

group_by_simpleLM(dat, c("y1", "y2"), "x", "f")
#$y1
#                    a          b
#alpha     0.820107094 -2.7164723
#beta     -0.009796302  0.8812007
#beta.se   0.266690568  0.2090644
#r2        0.000449565  0.8555330
#df.resid  3.000000000  3.0000000
#
#$y2
#                  a           b
#alpha     0.1304709  0.06996587
#beta     -0.1616069 -0.14685953
#beta.se   0.2465047  0.24815024
#r2        0.1253142  0.10454374
#df.resid  3.0000000  3.00000000

fit <- nlme::lmList(cbind(y1, y2) ~ x | f, data = dat, pool = FALSE)

## results for level "a"; use `fit[[2]]` to see results for level "b"
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
#                Estimate Std. Error     t value  Pr(>|t|)
#(Intercept)  0.820107094  0.8845125  0.92718537 0.4222195
#x           -0.009796302  0.2666906 -0.03673284 0.9730056
#
#$`Response y1`$r.squared
#[1] 0.000449565
#
#
#$`Response y2`
#$`Response y2`$coefficients
#              Estimate Std. Error    t value  Pr(>|t|)
#(Intercept)  0.1304709  0.8175638  0.1595850 0.8833471
#x           -0.1616069  0.2465047 -0.6555936 0.5588755
#
#$`Response y2`$r.squared
#[1] 0.1253142

dealing with rank-deficiency without break-down

## with unused level "b"
group_by_simpleLM(dat[1:5, ], "y1", "x", "f")
#$y1
#                    a   b
#alpha     0.820107094 NaN
#beta     -0.009796302 NaN
#beta.se   0.266690568 NaN
#r2        0.000449565 NaN
#df.resid  3.000000000 NaN

## rank-deficient case for level "b"
group_by_simpleLM(dat[1:6, ], "y1", "x", "f")
#$y1
#                    a        b
#alpha     0.820107094 -1.53995
#beta     -0.009796302      NaN
#beta.se   0.266690568      NaN
#r2        0.000449565      NaN
#df.resid  3.000000000  0.00000

more than one grouping variables

When we have more than one grouping variables, group_by_simpleLM can not handle them directly. But you can use interaction to first create a single factor variable.

dat$fg <- with(dat, interaction(f, g, drop = TRUE, sep = ":"))
group_by_simpleLM(dat, c("y1", "y2"), "x", "fg")
#$y1
#                a:A        b:A        a:B        b:B
#alpha     1.4750325 -2.7684583 -1.6393289 -1.8513669
#beta     -0.2120782  0.9861509  0.7993313  0.4613999
#beta.se   0.0000000  0.2098876  0.4946167  0.0000000
#r2        1.0000000  0.9566642  0.7231188  1.0000000
#df.resid  0.0000000  1.0000000  1.0000000  0.0000000
#
#$y2
#                a:A         b:A        a:B        b:B
#alpha     1.0292956 -0.22746944 -1.5096975 0.06876360
#beta     -0.2657021 -0.20650690  0.2547738 0.09172993
#beta.se   0.0000000  0.01945569  0.3483856 0.00000000
#r2        1.0000000  0.99120195  0.3484482 1.00000000
#df.resid  0.0000000  1.00000000  1.0000000 0.00000000

fit <- nlme::lmList(cbind(y1, y2) ~ x | fg, data = dat, pool = FALSE)

## note that the first group a:A only has two values, so df.resid = 0
## my method returns 0 standard error for the slope
## but lm or lmList would return NaN
lapply(summary(fit[[1]]), "[", c("coefficients", "r.squared"))
#$`Response y1`
#$`Response y1`$coefficients
#              Estimate Std. Error t value Pr(>|t|)
#(Intercept)  1.4750325        NaN     NaN      NaN
#x           -0.2120782        NaN     NaN      NaN
#
#$`Response y1`$r.squared
#[1] 1
#
#
#$`Response y2`
#$`Response y2`$coefficients
#              Estimate Std. Error t value Pr(>|t|)
#(Intercept)  1.0292956        NaN     NaN      NaN
#x           -0.2657021        NaN     NaN      NaN
#
#$`Response y2`$r.squared
#[1] 1

no grouping: a fast method of lm

group_by_simpleLM(dat, c("y1", "y2"), "x", NULL)
#$y1
#                ALL
#alpha    -0.9481826
#beta      0.4357022
#beta.se   0.2408162
#r2        0.2903691
#df.resid  8.0000000
#
#$y2
#                ALL
#alpha     0.1002184
#beta     -0.1542332
#beta.se   0.1514935
#r2        0.1147012
#df.resid  8.0000000

fast large simple linear regression

set.seed(0L)
nSubj <- 200e3
nr <- 1e6
DF <- data.frame(subject = gl(nSubj, 5),
                 day = 3:7,
                 y1 = runif(nr), 
                 y2 = rpois(nr, 3), 
                 y3 = rnorm(nr), 
                 y4 = rnorm(nr, 1, 5))

system.time(group_by_simpleLM(DF, paste0("y", 1:4), "day", "subject"))
#   user  system elapsed 
#  0.200   0.016   0.219 

library(MatrixModels)
system.time(glm4(y1 ~ 0 + subject + day:subject, data = DF, sparse = TRUE))
#   user  system elapsed 
#  9.012   0.172   9.266

group_by_simpleLM does all 4 responses almost instantly, while glm4 needs 9s for one response alone!

Note that glm4 can break down in rank-deficient case, while group_by_simpleLM would not.

Appendix: "group_by_simpleLM.cpp"

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List group_by_simpleLM_cpp (List Y, NumericVector x, IntegerVector group, CharacterVector group_levels, bool group_unsorted) {

  /* number of data and number of responses */
  int n = x.size(), k = Y.size(), n_groups = group_levels.size();

  /* set up result list */
  List result(k);
  List dimnames = List::create(CharacterVector::create("alpha", "beta", "beta.se", "r2", "df.resid"), group_levels);
  int j; for (j = 0; j < k; j++) {
    NumericMatrix mat(5, n_groups);
    mat.attr("dimnames") = dimnames;
    result[j] = mat;
    }
  result.attr("names") = Y.attr("names");

  /* set up a vector to hold sample size for each group */
  size_t *group_offset = (size_t *)calloc(n_groups + 1, sizeof(size_t));

  /*
    compute group offset: cumsum(group_offset)
    The offset is used in a different way when group is sorted or unsorted
    In the former case, it is the offset to real x, y values;
    In the latter case, it is the offset to ordering index indx
 */
  int *u = INTEGER(group), *u_end = u + n, i;
  if (n_groups > 1) {
    while (u < u_end) group_offset[*u++]++;
    for (i = 0; i < n_groups; i++) group_offset[i + 1] += group_offset[i];
    } else {
    group_offset[1] = n;
    group_unsorted = 0;
    }

  /* local variables & pointers */
  double *xi, *xi_end;    /* pointer to the 1st and the last x value */
  double *yi;             /* pointer to the first y value */
  int gi; double inv_gi;  /* sample size of the i-th group */
  double xi_mean, xi_var; /* mean & variance of x values in the i-th group */
  double yi_mean, yi_var; /* mean & variance of y values in the i-th group */
  double xiyi_cov;        /* covariance between x and y values in the i-th group */
  double beta, r2; int df_resi;
  double *matij;

  /* additional storage and variables when group is unsorted */
  int *indx; double *xb, *xbi, dtmp;
  if (group_unsorted) {
    indx = (int *)malloc(n * sizeof(int));
    xb = (double *)malloc(n * sizeof(double));  // buffer x for caching
    R_orderVector1(indx, n, group, TRUE, FALSE);  // Er, how is TRUE & FALSE recogonized as Rboolean?
    }

  /* loop through groups */
  for (i = 0; i < n_groups; i++) {
    /* set group size gi */
    gi = group_offset[i + 1] - group_offset[i];
    /* special case for a factor level with no data */
    if (gi == 0) {
      for (j = 0; j < k; j++) {
        /* matrix column for write-back */
        matij = REAL(result[j]) + i * 5;
        matij[0] = R_NaN; matij[1] = R_NaN; matij[2] = R_NaN;
        matij[3] = R_NaN; matij[4] = R_NaN;
        }
      continue;
      }
    /* rank-deficient case */
    if (gi == 1) {
      gi = group_offset[i];
      if (group_unsorted) gi = indx[gi];
      for (j = 0; j < k; j++) {
        /* matrix column for write-back */
        matij = REAL(result[j]) + i * 5;
        matij[0] = REAL(Y[j])[gi];
        matij[1] = R_NaN; matij[2] = R_NaN;
        matij[3] = R_NaN; matij[4] = 0.0;
        }
      continue;
      }
    /* general case where a regression line can be estimated */
    inv_gi = 1 / (double)gi;
    /* compute mean & variance of x values in this group */
    xi_mean = 0.0; xi_var = 0.0;
    if (group_unsorted) {
      /* use u, u_end and xbi */
      xi = REAL(x);
      u = indx + group_offset[i];  /* offset acts on index */
      u_end = u + gi;
      xbi = xb + group_offset[i];
      for (; u < u_end; xbi++, u++) {
        dtmp = xi[*u];
        xi_mean += dtmp;
        xi_var += dtmp * dtmp;
        *xbi = dtmp;
        }
      } else {
      /* use xi and xi_end */
      xi = REAL(x) + group_offset[i];  /* offset acts on values */
      xi_end = xi + gi;
      for (; xi < xi_end; xi++) {
        xi_mean += *xi;
        xi_var += (*xi) * (*xi);
        }
      }
    xi_mean = xi_mean * inv_gi;
    xi_var = xi_var * inv_gi - xi_mean * xi_mean;
    /* loop through responses doing simple linear regression */
    for (j = 0; j < k; j++) {
      /* compute mean & variance of y values, as well its covariance with x values */
      yi_mean = 0.0; yi_var = 0.0; xiyi_cov = 0.0;
      if (group_unsorted) {
        xbi = xb + group_offset[i];  /* use buffered x values */
        yi = REAL(Y[j]);
        u = indx + group_offset[i];  /* offset acts on index */
        for (; u < u_end; u++, xbi++) {
          dtmp = yi[*u];
          yi_mean += dtmp;
          yi_var += dtmp * dtmp;
          xiyi_cov += dtmp * (*xbi);
          } 
        } else {
        /* set xi and yi */
        xi = REAL(x) + group_offset[i];  /* offset acts on values */
        yi = REAL(Y[j]) + group_offset[i];  /* offset acts on values */
        for (; xi < xi_end; xi++, yi++) {
          yi_mean += *yi;
          yi_var += (*yi) * (*yi);
          xiyi_cov += (*yi) * (*xi);
          }
        }
      yi_mean = yi_mean * inv_gi;
      yi_var = yi_var * inv_gi - yi_mean * yi_mean;
      xiyi_cov = xiyi_cov * inv_gi - xi_mean * yi_mean;
      /* locate the right place to write back regression result */
      matij = REAL(result[j]) + i * 5 + 4;
      /* residual degree of freedom */
      df_resi = gi - 2; *matij-- = (double)df_resi;
      /* R-squared = squared correlation */
      r2 = (xiyi_cov * xiyi_cov) / (xi_var * yi_var); *matij-- = r2;
      /* standard error of regression slope */
      if (df_resi == 0) *matij-- = 0.0;
      else *matij-- = sqrt((1 - r2) * yi_var / (df_resi * xi_var));
      /* regression slope */
      beta = xiyi_cov / xi_var; *matij-- = beta;
      /* regression intercept */
      *matij = yi_mean - beta * xi_mean;
      }
    }

  if (group_unsorted) {
    free(indx);
    free(xb);
    }
  free(group_offset);
  return result;
  }

/*** R
group_by_simpleLM <- function (dat, LHS, RHS, group = NULL) {

  ## basic input validation
  if (missing(dat)) stop("no data provided to 'dat'!")
  if (!is.data.frame(dat)) stop("'dat' must be a data frame!")

  if (missing(LHS)) stop("no 'LHS' provided!")
  if (!is.character(LHS)) stop("'LHS' must be provided as a character vector of variable names!")

  if (missing(RHS)) stop("no 'RHS' provided!")
  if (!is.character(RHS)) stop("'RHS' must be provided as a character vector of variable names!")

  if (!is.null(group)) {

    ## grouping variable provided: a fast method of `nlme::lmList`

    if (!is.character(group)) stop("'group' must be provided as a character vector of variable names!")

    ## ensure that group has length 1, is available in the data frame and is a factor
    if (length(group) > 1L) {
      warning("only one grouping variable allowed for group-by simple linear regression; ignoring all but the 1st variable provided!")
      group <- group[1L]
      }
    grp <- dat[[group]]
    if (is.null(grp)) stop(sprintf("grouping variable '%s' not found in 'dat'!", group))

    if (is.factor(grp)) {
      grp_levels <- levels(grp)
      } else {
      warning("grouping variable is not provided as a factor; fast coercion is made!")
      grp_levels <- unique(grp)
      grp <- match(grp, grp_levels)
      grp_levels <- as.character(grp_levels)
      }

    grp_unsorted <- .Internal(is.unsorted(grp, FALSE))

    } else {

    ## no grouping; a fast method of `lm`
    grp <- 1L; grp_levels <- "ALL"; grp_unsorted <- FALSE

    }

  ## the RHS must has length 1, is available in the data frame and is numeric
  if (length(RHS) > 1L) {
    warning("only one RHS variable allowed for simple linear regression; ignoring all but the 1st variable provided!")
    RHS <- RHS[1L]
    }
  x <- dat[[RHS]]
  if (is.null(x)) stop(sprintf("RHS variable '%s' not found in 'dat'!", RHS))
  if (!is.numeric(x) || is.factor(x)) {
    stop("RHS variable must be 'numeric' for simple linear regression!")
    }
  x < as.numeric(x)  ## just in case that `x` is an integer

  ## check LHS variables
  nested <- match(RHS, LHS, nomatch = 0L)
  if (nested > 0L) {
    warning(sprintf("RHS variable '%s' found in LHS variables; removing it from LHS", RHS))
    LHS <- LHS[-nested]
    }
  if (length(LHS) == 0L) stop("no usable LHS variables found!")
  missed <- !(LHS %in% names(dat))
  if (any(missed)) {
    warning(sprintf("LHS variables '%s' not found in 'dat'; removing them from LHS", toString(LHS[missed])))
    LHS <- LHS[!missed]
    }
  if (length(LHS) == 0L) stop("no usable LHS variables found!")
  Y <- dat[LHS]
  invalid_LHS <- vapply(Y, is.factor, FALSE) | (!vapply(Y, is.numeric, FALSE))
  if (any(invalid_LHS)) {
    warning(sprintf("LHS variables '%s' are non-numeric or factors; removing them from LHS", toString(LHS[invalid_LHS])))
    Y <- Y[!invalid_LHS]
    }
  if (length(Y) == 0L) stop("no usable LHS variables found!")
  Y <- lapply(Y, as.numeric)  ## just in case that we have integer variables in Y

  ## check for exsitence of NA, NaN, Inf and -Inf and drop them?

  ## use Rcpp
  group_by_simpleLM_cpp(Y, x, grp, grp_levels, grp_unsorted)
  }
*/

Fast group-by simple linear regression

1 Answers

Examples