regsubsets is generating dummy variables from factor levels - can we make it not do that?

Question

The function regsubsets from the leaps package treats all levels of a categorical(factor) variable as independent dummy variables. I would like to change this behavior.

Example using the iris dataset, where Species is a factor variable:

library(leaps)
data(iris)

models <- regsubsets( Sepal.Length~., data = iris, nvmax = 4)
summary(models)

Subset selection object
Call: regsubsets.formula(Sepal.Length ~ ., data = iris, nvmax = 4)
5 Variables  (and intercept)
                  Forced in Forced out
Sepal.Width           FALSE      FALSE
Petal.Length          FALSE      FALSE
Petal.Width           FALSE      FALSE
Speciesversicolor     FALSE      FALSE
Speciesvirginica      FALSE      FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
         Sepal.Width Petal.Length Petal.Width Speciesversicolor Speciesvirginica
1  ( 1 ) " "         "*"          " "         " "               " "             
2  ( 1 ) "*"         "*"          " "         " "               " "             
3  ( 1 ) "*"         "*"          "*"         " "               " "             
4  ( 1 ) "*"         "*"          " "         "*"               "*"

Please notice that regsubsets created the dummy variables Speciesversicolor and Speciesvirginica that now take up two of the four 'spaces' for variables in the fourth row. I would like Species to just take one space.

Is it possible to change this behavior of the regsubsets function?

A similar question has been asked before, but most of the comments (and I) agree that the question remains unanswered: https://stats.stackexchange.com/questions/152158/r-model-selection-with-categorical-variables-using-leaps-and-glmnet

Here is another similar and unanswered question: R: can I get regsubsets() to in-/exclude variables by groups?

Steen Harsted Steen Harsted · Accepted Answer · 2020-10-07T07:10:25

As G.Grothendieck stated 'leaps only works with quantitative variables'.

There are however other packages that can perform feature selection using best subsets regression than can handle categorical variables with multiple levels.

It can be done in one line of code using the olsrr package and the example data.

library(olsrr)
data("iris")

model <- lm(Sepal.Length ~ ., data = iris)

ols_step_best_subset(model)
#>                   Best Subsets Regression                  
#> -----------------------------------------------------------
#> Model Index    Predictors
#> -----------------------------------------------------------
#>      1         Petal.Length                                 
#>      2         Sepal.Width Petal.Length                     
#>      3         Sepal.Width Petal.Length Species             
#>      4         Sepal.Width Petal.Length Petal.Width Species 
#> -----------------------------------------------------------
#> 
#>                                                     Subsets Regression Summary                                                     
#> -----------------------------------------------------------------------------------------------------------------------------------
#>                        Adj.        Pred                                                                                             
#> Model    R-Square    R-Square    R-Square      C(p)        AIC         SBIC         SBC        MSEP       FPE       HSP       APC  
#> -----------------------------------------------------------------------------------------------------------------------------------
#>   1        0.7600      0.7583      0.7532    114.5104    160.0404    -267.6979    169.0723    24.8565    0.1679    0.0011    0.2465 
#>   2        0.8402      0.8380       0.834     29.4478    101.0255    -325.5037    113.0680    16.6628    0.1133     8e-04    0.1663 
#>   3        0.8633      0.8595      0.8536      6.3448     81.5749    -346.0176     99.6387    14.3495    0.0989     7e-04    0.1442 
#>   4        0.8673      0.8627      0.8554      4.0000     79.1160    -348.1523    100.1905    14.0259    0.0973     7e-04    0.1418 
#> -----------------------------------------------------------------------------------------------------------------------------------
#> AIC: Akaike Information Criteria 
#>  SBIC: Sawa's Bayesian Information Criteria 
#>  SBC: Schwarz Bayesian Criteria 
#>  MSEP: Estimated error of prediction, assuming multivariate normality 
#>  FPE: Final Prediction Error 
#>  HSP: Hocking's Sp 
#>  APC: Amemiya Prediction Criteria

We can also use all possible subsets feature selection:

ols_step_all_possible(model)
#>    Index N                                   Predictors   R-Square
#> 2      1 1                                 Petal.Length 0.75995465
#> 3      2 1                                  Petal.Width 0.66902769
#> 4      3 1                                      Species 0.61870573
#> 1      4 1                                  Sepal.Width 0.01382265
#> 5      5 2                     Sepal.Width Petal.Length 0.84017784
#> 9      6 2                         Petal.Length Species 0.83672378
#> 8      7 2                     Petal.Length Petal.Width 0.76626130
#> 7      8 2                          Sepal.Width Species 0.72590661
#> 6      9 2                      Sepal.Width Petal.Width 0.70723708
#> 10    10 2                          Petal.Width Species 0.66936637
#> 12    11 3             Sepal.Width Petal.Length Species 0.86330878
#> 11    12 3         Sepal.Width Petal.Length Petal.Width 0.85861172
#> 14    13 3             Petal.Length Petal.Width Species 0.83672544
#> 13    14 3              Sepal.Width Petal.Width Species 0.73238452
#> 15    15 4 Sepal.Width Petal.Length Petal.Width Species 0.86731226
#>    Adj. R-Square Mallow's Cp
#> 2    0.758332718  114.510364
#> 3    0.666791387  213.189280
#> 4    0.613518054  267.801422
#> 1    0.007159294  924.253661
#> 5    0.838003384   29.447765
#> 9    0.833368794   33.196290
#> 8    0.763081179  109.666040
#> 7    0.720274553  153.461158
#> 6    0.703253908  173.722356
#> 10   0.662572526  214.821724
#> 12   0.859537986    6.344799
#> 11   0.855706481   11.442304
#> 14   0.832221316   35.194491
#> 13   0.725002020  148.430978
#> 15   0.862705049    4.000000

^{Created on 2020-10-07 by the reprex package (v0.3.0)}

regsubsets is generating dummy variables from factor levels - can we make it not do that?

2 Answers