2
votes

I am new to machine learning and xgboostand I am solving a regression problem. My target value are very small (e.g.-1.23e-12).

I am using linear regression and xgboost regressor, but xgboost always predicts the same values, like:

[1.32620335e-05 1.32620335e-05 ... 1.32620335e-05].

I tried to tune some parameters in xgboost.regressor, but it also predicted the same values.

I've seen Scaling of target causes Scikit-learn SVM regression to break down , so I tried to scale my target value to likes(data.target = data.target*(10**12)) , and it fixed the problem. But I am not sure this is reasonable to scale my target value, and I don't know if this problem in xgboost is the same to SVR? .

Here is target value of my data:


    count    2.800010e+05
    mean    -1.722068e-12
    std      6.219815e-13
    min     -4.970697e-12
    25%     -1.965893e-12
    50%     -1.490800e-12
    75%     -1.269998e-12
    max     -1.111604e-12

And part of my code:



    X = df[feature].values
    y = df[target].values *(10**(12))
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    xgb = xgboost.XGBRegressor()
    LR = linear_model.LinearRegression()
    xgb.fit(X_train,y_train)
    LR.fit(X_train,y_train)
    xgb_predicted = xgb.predict(X_test)
    LR_predicted = LR.predict(X_test)
    print('xgb predicted:',xgb_predicted[0:5])
    print('LR predicted:',LR_predicted[0:5])
    print('ground truth:',y_test[0:5])


Output:


    xgb predicted: [-1.5407631 -1.49756   -1.9647646 -2.7702322 -2.5296502]
    LR predicted: [-1.60908805 -1.51145989 -1.71565321 -2.25043287 -1.65725868]
    ground truth: [-1.6572993  -1.59879922 -2.39709641 -2.26119817 -2.01300088]

And the output with y = df[target].values (i.e., did not scale target value)


    xgb predicted: [1.32620335e-05 1.32620335e-05 1.32620335e-05 1.32620335e-05
     1.32620335e-05]
    LR predicted: [-1.60908805e-12 -1.51145989e-12 -1.71565321e-12 -2.25043287e-12
     -1.65725868e-12]
    ground truth: [-1.65729930e-12 -1.59879922e-12 -2.39709641e-12 -2.26119817e-12
     -2.01300088e-12]

1
You are probably hitting precision issues (since values are so small). Scaling is okay for linear regression.You are just solving 1e12*(ax+b) instead of ax+b. Is that your question?dgumo
hi dumgo, yes, but even if I didn't scale target with 1e12*(ax+b) , the linear regression can predict more reasonable value than xgboost.chuzz
Can you share some details of your data and code?dgumo
OK, I just updatechuzz

1 Answers

2
votes

Let's try something simpler. I suspect that if you tried to fit a DecisionTreeRegressor (sklearn) to your problem (without scaling) you will likely see similar behavior.

Also, most likely, the nodes in your (xgboost) trees are not getting split at all, see by doing xgb.get_booster().get_dump()

Now, try this: run multiple experiments, scale your y such that each y is of the order 1e-1, then next experiment scale such that order of y is 1e-2, so on. You will see that decision tree stops splitting around some order. I believe it is linked to minimum impurity value, for example, sklearn decision tree value is here https://github.com/scikit-learn/scikit-learn/blob/ed5e127b/sklearn/tree/tree.py#L285 (around 1e-7)

This is my best guess at the moment. If someone can add to or verify this then I'll be happy to learn :)