0
votes

I am working on a problem to predict output label based on certain input values. Since I do not have real data, I am creating some dummy data so that I can have my code ready by the time I get the data. Below is what the sample data looks like. There are a bunch of input values and the last column 'output' is the output label to be predicted.

input_1,input_2,input_3,input_4,input_5,input_6,input_7,input_8,input_9,input_10,input_11,input_12,input_13,input_14,input_15,input_16,input_17,input_18,input_19,input_20,input_21,input_22,input_23,input_24,input_25,input_26,input_27,input_28,input_29,input_30,input_31,input_32,output
0.0,97.0,155,143,98,145,102,102,144,100,96,193,90,98,98,122,101,101,101,98,99,96,118,148,98,99,112,94,98,100,96.0,95,loc12
96.0,94.0,116,99,98,105,95,101,168,101,96,108,95,98,98,96,102,98,98,99,98,98,132,150,102,101,195,104,96,97,93.0,98,loc27

Since this is dummy data, I am setting the output label to the input that has the maximum value. For e.g. in the first row, the maximum value is at 12th location so output is set to loc12. My expectation is that the XGBoost algorithm should learn this on its own and predict the output label correctly.

I have written below code to train and test XGBoost.

from __future__ import division
import numpy as np
import pandas as pd
import scipy.sparse
import pickle
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, LabelBinarizer

df=pd.read_csv("data.txt", sep=',')

# Create training and validation sets
sz = df.shape
train = df.iloc[:int(sz[0] * 0.7), :]
test = df.iloc[int(sz[0] * 0.7):, :]

# Separate X & Y for training
train_X = train.iloc[:, :32].values
train_Y = train.iloc[:, 32].values

# Separate X & Y for test
test_X = test.iloc[:, :32].values
test_Y = test.iloc[:, 32].values

# Get the count of  unique output labels
num_classes = df.output.nunique()

lb = LabelBinarizer()
train_Y = lb.fit_transform(train_Y.tolist())
test_Y = lb.fit_transform(test_Y.tolist())

# Normalize the training data
#train_X -= np.mean(train_X, axis=0)
#train_X /= np.std(train_X, axis=0)
#train_X /= 255

# Normalize the test data
#test_X -= np.mean(test_X, axis=0)
#test_X /= np.std(test_X, axis=0)
#test_X /= 255

xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)

# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softmax'
# scale weight of positive examples
param['eta'] = 0.1
param['max_depth'] = 6
param['silent'] = 1
param['nthread'] = 4
param['num_class'] = num_classes

watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 5
bst = xgb.train(param, xg_train, num_round, watchlist)
#bst.dump_model('dump.raw.txt')
# get prediction
pred = bst.predict(xg_test)
actual = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred != actual) / test_Y.shape[0]
print('Test error using softmax = {}'.format(error_rate))

# do the same thing again, but output probabilities
param['objective'] = 'multi:softprob'
bst = xgb.train(param, xg_train, num_round, watchlist)
# Note: this convention has been changed since xgboost-unity
# get prediction, this is in 1D array, need reshape to (ndata, nclass)
pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], num_classes)
pred_label = np.argmax(pred_prob, axis=1)
actual_label = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred_label != actual_label) / test_Y.shape[0]
print('Test error using softprob = {}'.format(error_rate))

However I am observing that it is always predicting label 0, i.e. first index in the one-hot encoded output.

Output:

[0] train-merror:0.11081    test-merror:0.111076
[1] train-merror:0.11081    test-merror:0.111076
[2] train-merror:0.11081    test-merror:0.111076
[3] train-merror:0.111216   test-merror:0.111076
[4] train-merror:0.11081    test-merror:0.111076
Test error using softmax = 0.64846954875355
[0] train-merror:0.11081    test-merror:0.111076
[1] train-merror:0.11081    test-merror:0.111076
[2] train-merror:0.11081    test-merror:0.111076
[3] train-merror:0.111216   test-merror:0.111076
[4] train-merror:0.11081    test-merror:0.111076
Test error using softprob = 0.64846954875355

Prediction:

pred_prob[0:10]
array([[0.34024397, 0.10218474, 0.07965304, 0.07965304, 0.07965304,
        0.07965304, 0.07965304, 0.07965304, 0.07965304],
       [0.34009758, 0.10257103, 0.07961877, 0.07961877, 0.07961877,
        0.07961877, 0.07961877, 0.07961877, 0.07961877],
       [0.34421352, 0.09171014, 0.08058234, 0.08058234, 0.08058234,
        0.08058234, 0.08058234, 0.08058234, 0.08058234],
       [0.33950377, 0.10413795, 0.07947975, 0.07947975, 0.07947975,
        0.07947975, 0.07947975, 0.07947975, 0.07947975],
       [0.3426607 , 0.09580766, 0.08021881, 0.08021881, 0.08021881,
        0.08021881, 0.08021881, 0.08021881, 0.08021881],
       [0.33777002, 0.10427278, 0.07970817, 0.07970817, 0.07970817,
        0.07970817, 0.07970817, 0.07970817, 0.07970817],
       [0.33733884, 0.10985068, 0.07897293, 0.07897293, 0.07897293,
        0.07897293, 0.07897293, 0.07897293, 0.07897293],
       [0.33953893, 0.10404517, 0.07948799, 0.07948799, 0.07948799,
        0.07948799, 0.07948799, 0.07948799, 0.07948799],
       [0.33987975, 0.10314585, 0.07956778, 0.07956778, 0.07956778,
        0.07956778, 0.07956778, 0.07956778, 0.07956778],
       [0.34013695, 0.10246711, 0.07962799, 0.07962799, 0.07962799,
        0.07962799, 0.07962799, 0.07962799, 0.07962799]], dtype=float32)

Whatever accuracy I'm getting is because of predicting label 0 which is around 35% of the data.

Is my expectation correct here? Are the input features too many and data too little for it to learn properly?

Full code: Here

Test Data: Here

1
You can see from the training dump, that the loss does not change between iterations/trees. This could be due to limited amount of data, as you pointed out. But this is very easy and fast to check- just increase the length in slicing.Mischa Lisovyi
Could you elaborate what exactly you mean by increasing the slicing length? Increasing the size of validation set?Nikhil Utane
Sorry, i got confused with the 32 in slicing. Now i got that. But my proposal was to get a larger training set (since it seems that xgboost can not find any pattern in the data)Mischa Lisovyi
Thanks. I will first try to use lesser parameters, say top 3 values. Just to see if that works as expected.Nikhil Utane
If I use LabelEncoder(), then it is working fine. But if I use LabelBinarizer() then it is always predicting 0. Am I missing something here?Nikhil Utane

1 Answers

0
votes

For anyone else with this issue like me, check your xgb.train parameter:'num_boost_round'. Make sure it is equal or about same with xgb.cv. I think the problem is the model has not been trained, hence, stopped too early.