3
votes

I have a large data set consisting of factor variables, numeric variables, and a target column I'm trying to properly feed into xgboost with the objective of making an xgb.Matrix and training a model.

I'm confused about the proper processing to get my dataframe into an xgb.DMatrix object. Specifically, I have NAs in both factor and numeric variables and I want to make a sparse.model.matrix from my dataframe before creating the xgb.Matrix. The proper handling of the NAs is really messing me up.

I have the following sample dataframe df consisting of one binary categorical variable, two continuous variables, and a target. the categorical variable and one continuous variable has NAs

'data.frame':   10 obs. of  4 variables:
 $ v1    : Factor w/ 2 levels "0","1": 1 2 2 1 NA 2 1 1 NA 2
 $ v2    : num  3.2 5.4 8.3 NA 7.1 8.2 9.4 NA 9.9 4.2
 $ v3    : num  22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
 $ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1

     v1  v2   v3 target
1     0 3.2 22.1      0
2     1 5.4 44.1      0
3     1 8.3 57.0      1
4     0  NA 64.2      1
5  <NA> 7.1 33.1      0
6     1 8.2 56.9      0
7     0 9.4 71.2      0
8     0  NA 33.9      1
9  <NA> 9.9 89.3      0
10    1 4.2 97.2      0

sparse.model.matrix from the matrix library won't accept NAs. It eliminates the rows (which I don't want). So I'll need to change the NAs to a numeric replacement like -999

if I use the simple command:

df[is.na(df)] = -999

it only replaces the NAs in the numeric columns:

     v1     v2   v3 target
1     0    3.2 22.1      0
2     1    5.4 44.1      0
3     1    8.3 57.0      1
4     0 -999.0 64.2      1
5  <NA>    7.1 33.1      0
6     1    8.2 56.9      0
7     0    9.4 71.2      0
8     0 -999.0 33.9      1
9  <NA>    9.9 89.3      0
10    1    4.2 97.2      0

So I first (think I) need to change the factor variables to numeric and then do the substitution. Doing that I get:

     v1     v2   v3 target
1     1    3.2 22.1      0
2     2    5.4 44.1      0
3     2    8.3 57.0      1
4     1 -999.0 64.2      1
5  -999    7.1 33.1      0
6     2    8.2 56.9      0
7     1    9.4 71.2      0
8     1 -999.0 33.9      1
9  -999    9.9 89.3      0
10    2    4.2 97.2      0

but converting the factor variable back to a factor (I think this is necessary so xgboost will later know its a factor) I get three levels:

data.frame':    10 obs. of  4 variables:
 $ v1    : Factor w/ 3 levels "-999","1","2": 2 3 3 2 1 3 2 2 1 3
 $ v2    : num  3.2 5.4 8.3 -999 7.1 8.2 9.4 -999 9.9 4.2
 $ v3    : num  22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
 $ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1

I'm ultimately not sure now that making the sparse.model.matrix and ultimately the xgb.matrix object will be meaningful because v1 appears messed up.

To make matters more confusing, xgb.Dmatrix() has an argument missing that I can use to identify numeric values (-999) that represent NAs. But this can only be used for a dense matrix. If I submitted the dense matrix I'd just have the NAs and wouldn't need that. However, in the sparse matrix where I have -999s, I can't use it.

I hope I'm not overlooking something easy. Been through xgboost.pdf extensively and looked on Google.

Please help. Thanks in advance.

1
Why do you want to make a sparse matrix if you have only 3 predictors? Secondly, xgboost only accepts numerical features.mtoto
1. my real dataset has 185 predictors and 70000 observations. 2. xgboost accepts categorical variables (I think) when changed to numeric form (0,1)s like Ive done.Windstorm1981
You can just use options(na.action='na.pass'); sm<-sparse.model.matrix(target~.,df), then feed it into xgboost. The sparse matrix will only contain non-missing values either way for your dummy variables. For continuous vars, a missing(default) direction is learnt.mtoto
@mtoto Thx. Why do I see people replacing NAs with -9999 all over the place and then making sparse matrices?Windstorm1981
See here. If you think your NA's are not randomly distributed, you can convert them to a non-missing value and see if it has predictive power.mtoto

1 Answers

2
votes

options(na.action='na.pass') as mentioned by @mtoto is the best way to deal with this problem. It will make sure that you don't loose any data while building model matrix.

Specifically XGBoost implementation; in case of NAs, check for higher gain when doing the splits while growing tree. So for example if splits without considering NAs is decided to be a variable var1's (range [0,1]) value 0.5 then it calculates the gain considering var1 NAs to be < 0.5 and > 0.5. To whatever split direction it gets more gain it attributes NAs to have that split direction. So NAs now have a range [0,0.5] or [0.5,1] but not actual value assigned to it (i.e. imputed). Refer (original author tqchen's comment on Aug 12, 2014).

If you are imputing -99xxx there then you are limiting the algorithm ability to learn NA's proper range (conditional on labels).