I'm trying to implement a boosted Poisson regression model in xgboost, but I am finding the results are biased at low frequencies. To illustrate, here is some minimal Python code that I think replicates the issue:
import numpy as np
import pandas as pd
import xgboost as xgb
def get_preds(mult):
# generate toy dataset for illustration
# 4 observations with linearly increasing frequencies
# the frequencies are scaled by `mult`
dmat = xgb.DMatrix(data=np.array([[0, 0], [0, 1], [1, 0], [1, 1]]),
label=[i*mult for i in [1, 2, 3, 4]],
weight=[1000, 1000, 1000, 1000])
# train a poisson booster on the toy data
bst = xgb.train(
params={"objective": "count:poisson"},
dtrain=dmat,
num_boost_round=100000,
early_stopping_rounds=5,
evals=[(dmat, "train")],
verbose_eval=False)
# return fitted frequencies after reversing scaling
return bst.predict(dmat)/mult
# test multipliers in the range [10**(-8), 10**1]
# display fitted frequencies
mults = [10**i for i in range(-8, 1)]
df = pd.DataFrame(np.round(np.vstack([get_preds(m) for m in mults]), 0))
df.index = mults
df.columns = ["(0, 0)", "(0, 1)", "(1, 0)", "(1, 1)"]
df
# --- result ---
# (0, 0) (0, 1) (1, 0) (1, 1)
#1.000000e-08 11598.0 11598.0 11598.0 11598.0
#1.000000e-07 1161.0 1161.0 1161.0 1161.0
#1.000000e-06 118.0 118.0 118.0 118.0
#1.000000e-05 12.0 12.0 12.0 12.0
#1.000000e-04 2.0 2.0 3.0 3.0
#1.000000e-03 1.0 2.0 3.0 4.0
#1.000000e-02 1.0 2.0 3.0 4.0
#1.000000e-01 1.0 2.0 3.0 4.0
#1.000000e+00 1.0 2.0 3.0 4.0
Notice that at low frequencies the predictions seem to blow up. This may have something to do with the Poisson lambda * the weight dropping below 1 (and in fact increasing the weight above 1000 does shift the "blow-up" to lower frequencies), but I would still expect the predictions to approach the mean training frequency (2.5). Also (not shown in the example above), reducing eta
seems to increase the amount of bias in the predictions.
What would cause this to happen? Is a parameter available that would mitigate the effect?