Summary
I am using Python v3.7 and xgboost v0.81. I have continuous data (y) at a US state level by each week from 2015 to 2019. I'm trying to regress on the following features to y: year, month, week, region (encoded). I've set the train as August 2018 and before and the test is September 2018 and onward. When I train the model this way, two weird things happen:
- feature_importances are all nan
- predictions are all the same (0.5, 0.5....)
What I've tried
Fixing any of the features to a single variable allows the model to train appropriately and the two weird issues encountered previously are gone. Ex. year==2017 or region==28
Code
(I know this is a temporal problem but this general case exhibits the problem as well)
X = df[['year', 'month', 'week', 'region_encoded']]
display(X)
y = df.target
display(y)
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.1)
model = XGBRegressor(n_jobs=-1, n_estimators=1000).fit(X_train, y_train)
display(model.predict(X_test)[:20])
display(model.feature_importances_)
Results - some of the predictions and the feature importances
year month week region_encoded
0 2015 10 40 0
1 2015 10 40 1
2 2015 10 40 2
3 2015 10 40 3
4 2015 10 40 4
0 272.0
1 10.0
2 290.0
3 46.0
4 558.0
Name: target, dtype: float64
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], dtype=float32)
array([nan, nan, nan, nan], dtype=float32)
df['target']
anddf[[...]]
for y and X respectively looks like, please? Also could you pass df['target'].values and df[[...]].values to XGBoost just to be safe. – SARose