I have a large data set consisting of factor variables, numeric variables, and a target column I'm trying to properly feed into xgboost with the objective of making an xgb.Matrix and training a model.
I'm confused about the proper processing to get my dataframe into an xgb.DMatrix object. Specifically, I have NAs in both factor and numeric variables and I want to make a sparse.model.matrix from my dataframe before creating the xgb.Matrix. The proper handling of the NAs is really messing me up.
I have the following sample dataframe df
consisting of one binary categorical variable, two continuous variables, and a target. the categorical variable and one continuous variable has NAs
'data.frame': 10 obs. of 4 variables:
$ v1 : Factor w/ 2 levels "0","1": 1 2 2 1 NA 2 1 1 NA 2
$ v2 : num 3.2 5.4 8.3 NA 7.1 8.2 9.4 NA 9.9 4.2
$ v3 : num 22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
$ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1
v1 v2 v3 target
1 0 3.2 22.1 0
2 1 5.4 44.1 0
3 1 8.3 57.0 1
4 0 NA 64.2 1
5 <NA> 7.1 33.1 0
6 1 8.2 56.9 0
7 0 9.4 71.2 0
8 0 NA 33.9 1
9 <NA> 9.9 89.3 0
10 1 4.2 97.2 0
sparse.model.matrix
from the matrix
library won't accept NAs. It eliminates the rows (which I don't want). So I'll need to change the NAs to a numeric replacement like -999
if I use the simple command:
df[is.na(df)] = -999
it only replaces the NAs in the numeric columns:
v1 v2 v3 target
1 0 3.2 22.1 0
2 1 5.4 44.1 0
3 1 8.3 57.0 1
4 0 -999.0 64.2 1
5 <NA> 7.1 33.1 0
6 1 8.2 56.9 0
7 0 9.4 71.2 0
8 0 -999.0 33.9 1
9 <NA> 9.9 89.3 0
10 1 4.2 97.2 0
So I first (think I) need to change the factor variables to numeric and then do the substitution. Doing that I get:
v1 v2 v3 target
1 1 3.2 22.1 0
2 2 5.4 44.1 0
3 2 8.3 57.0 1
4 1 -999.0 64.2 1
5 -999 7.1 33.1 0
6 2 8.2 56.9 0
7 1 9.4 71.2 0
8 1 -999.0 33.9 1
9 -999 9.9 89.3 0
10 2 4.2 97.2 0
but converting the factor variable back to a factor (I think this is necessary so xgboost will later know its a factor) I get three levels:
data.frame': 10 obs. of 4 variables:
$ v1 : Factor w/ 3 levels "-999","1","2": 2 3 3 2 1 3 2 2 1 3
$ v2 : num 3.2 5.4 8.3 -999 7.1 8.2 9.4 -999 9.9 4.2
$ v3 : num 22.1 44.1 57 64.2 33.1 56.9 71.2 33.9 89.3 97.2
$ target: Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 2 1 1
I'm ultimately not sure now that making the sparse.model.matrix and ultimately the xgb.matrix object will be meaningful because v1 appears messed up.
To make matters more confusing, xgb.Dmatrix()
has an argument missing
that I can use to identify numeric values (-999) that represent NAs. But this
can only be used for a dense matrix. If I submitted the dense matrix I'd
just have the NAs and wouldn't need that. However, in the sparse matrix
where I have -999s, I can't use it.
I hope I'm not overlooking something easy. Been through xgboost.pdf extensively and looked on Google.
Please help. Thanks in advance.
xgboost
only accepts numerical features. – mtotooptions(na.action='na.pass'); sm<-sparse.model.matrix(target~.,df)
, then feed it intoxgboost
. The sparse matrix will only contain non-missing values either way for your dummy variables. For continuous vars, a missing(default) direction is learnt. – mtotoNA
's are not randomly distributed, you can convert them to a non-missing value and see if it has predictive power. – mtoto