1
votes

I can generate predictions for a fitted linear model by assigning the output of lm() to a name, like fit_lm, and then use predict() with that fit to generate predictions on newdata (see reprex below).

With big regressions, lm() objects can become large, as they carry with them the original data with which they were fit, as well as some other potentially large pieces of data. When I am doing this in an automated fashion over many datasets, the individual lm objects can take up a lot of space, and I don't want to carry around the whole lm object. I would like to extract a predictive function from the fit that I can store and use for prediction. Is there an easy way to just extract/construct from the fit a function that does prediction? At the very bottom of my reprex in comments is an example of how I envision the code working.

# Do a lm fit
set.seed(1234)
df <- data.frame(x = 1:9, y = 2 * 1:9 + 3 + rnorm(9, sd = 0.5))
fit <- lm(y ~ x, df)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.0125 -0.1178 -0.1007  0.3780  0.6995 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.8519     0.4035   7.068 0.000199 ***
#> x             1.9969     0.0717  27.851 1.98e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5554 on 7 degrees of freedom
#> Multiple R-squared:  0.9911, Adjusted R-squared:  0.9898 
#> F-statistic: 775.7 on 1 and 7 DF,  p-value: 1.976e-08

# Predict it
predict(fit, data.frame(x = 5:6))
#>        1        2 
#> 12.83658 14.83351

# Like to see that I could extract the fit as a function that could be used:
#
# f <- regressionFunction(fit)
# vector_of_fits <- f(data.frame(x = 5:6))
#
# vector_of_fits would equal: 
#>        1        2 
#> 12.83658 14.83351

Created on 2020-01-07 by the reprex package (v0.3.0)

2

2 Answers

6
votes

First, we borrow a function from this other question that reduces the size of the lm object.

clean_model = function(cm) {
  # just in case we forgot to set
  # y=FALSE and model=FALSE
  cm$y = c()
  cm$model = c()

  cm$residuals = c()
  cm$fitted.values = c()
  cm$effects = c()
  cm$qr$qr = c()
  cm$linear.predictors = c()
  cm$weights = c()
  cm$prior.weights = c()
  cm$data = c()

  # also try and avoid some large environments
  attr(cm$terms,".Environment") = c()
  attr(cm$formula,".Environment") = c()

  cm
}

Then write a simple wrapper that reduces the model and returns the prediction function:

prediction_function <- function(model) {
  stopifnot(inherits(model, 'lm'))
  model <- clean_model(model)
  function (...) predict(model, ...)
}

Example:

set.seed(1234)
df <- data.frame(x = 1:9, y = 2 * 1:9 + 3 + rnorm(9, sd = 0.5))
fit <- lm(y ~ x, df)
f <- prediction_function(fit)
f(data.frame(x = 5:6))
       1        2 
12.83658 14.83351 

Check sizes:

object.size(fit)
# 16648 bytes

object.size(prediction_function)
# 8608 bytes

For this small example we save half the space.

Let's use some larger data:

data(diamonds, package = 'ggplot2')

fit2 <- lm(carat ~ price, diamonds)
predict(fit2, data.frame(price = 200))
f2 <- prediction_function(fit2)
f2(data.frame(price = 200))

print(object.size(fit2), units = 'Mb'); 
object.size(f2)

Now we go from 13 Mb to 5376 bytes.

1
votes

this is an answer using the helpful broom package to tidy model output.

library(broom)
set.seed(1234)
df <- data.frame(x = 1:9, y = 2 * 1:9 + 3 + rnorm(9, sd = 0.5))
fit <- lm(y ~ x, df)
summary(fit)
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -1.0125 -0.1178 -0.1007  0.3780  0.6995 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.8519     0.4035   7.068 0.000199 ***
#> x             1.9969     0.0717  27.851 1.98e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.5554 on 7 degrees of freedom
#> Multiple R-squared:  0.9911, Adjusted R-squared:  0.9898 
#> F-statistic: 775.7 on 1 and 7 DF,  p-value: 1.976e-08
predict(fit, data.frame(x = 5:6))
#>        1        2 
#> 12.83658 14.83351

# store model coef in data frame using broom
model_params <- tidy(fit)
model_params
#> # A tibble: 2 x 5
#>   term        estimate std.error statistic      p.value
#>   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 (Intercept)     2.85    0.403       7.07 0.000199    
#> 2 x               2.00    0.0717     27.9  0.0000000198

# create function to predict from model params
predict_from_params <- function(x, model_params){
  model_params[1,]$estimate + x * model_params[2,]$estimate
  }

predict_from_params(df$x, model_params)
#> [1]  4.848859  6.845790  8.842720 10.839651 12.836581 14.833512 16.830442
#> [8] 18.827373 20.824303