I am working on a problem to predict output label based on certain input values. Since I do not have real data, I am creating some dummy data so that I can have my code ready by the time I get the data. Below is what the sample data looks like. There are a bunch of input values and the last column 'output' is the output label to be predicted.
input_1,input_2,input_3,input_4,input_5,input_6,input_7,input_8,input_9,input_10,input_11,input_12,input_13,input_14,input_15,input_16,input_17,input_18,input_19,input_20,input_21,input_22,input_23,input_24,input_25,input_26,input_27,input_28,input_29,input_30,input_31,input_32,output
0.0,97.0,155,143,98,145,102,102,144,100,96,193,90,98,98,122,101,101,101,98,99,96,118,148,98,99,112,94,98,100,96.0,95,loc12
96.0,94.0,116,99,98,105,95,101,168,101,96,108,95,98,98,96,102,98,98,99,98,98,132,150,102,101,195,104,96,97,93.0,98,loc27
Since this is dummy data, I am setting the output label to the input that has the maximum value. For e.g. in the first row, the maximum value is at 12th location so output is set to loc12. My expectation is that the XGBoost algorithm should learn this on its own and predict the output label correctly.
I have written below code to train and test XGBoost.
from __future__ import division
import numpy as np
import pandas as pd
import scipy.sparse
import pickle
import xgboost as xgb
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, LabelBinarizer
df=pd.read_csv("data.txt", sep=',')
# Create training and validation sets
sz = df.shape
train = df.iloc[:int(sz[0] * 0.7), :]
test = df.iloc[int(sz[0] * 0.7):, :]
# Separate X & Y for training
train_X = train.iloc[:, :32].values
train_Y = train.iloc[:, 32].values
# Separate X & Y for test
test_X = test.iloc[:, :32].values
test_Y = test.iloc[:, 32].values
# Get the count of unique output labels
num_classes = df.output.nunique()
lb = LabelBinarizer()
train_Y = lb.fit_transform(train_Y.tolist())
test_Y = lb.fit_transform(test_Y.tolist())
# Normalize the training data
#train_X -= np.mean(train_X, axis=0)
#train_X /= np.std(train_X, axis=0)
#train_X /= 255
# Normalize the test data
#test_X -= np.mean(test_X, axis=0)
#test_X /= np.std(test_X, axis=0)
#test_X /= 255
xg_train = xgb.DMatrix(train_X, label=train_Y)
xg_test = xgb.DMatrix(test_X, label=test_Y)
# setup parameters for xgboost
param = {}
# use softmax multi-class classification
param['objective'] = 'multi:softmax'
# scale weight of positive examples
param['eta'] = 0.1
param['max_depth'] = 6
param['silent'] = 1
param['nthread'] = 4
param['num_class'] = num_classes
watchlist = [(xg_train, 'train'), (xg_test, 'test')]
num_round = 5
bst = xgb.train(param, xg_train, num_round, watchlist)
#bst.dump_model('dump.raw.txt')
# get prediction
pred = bst.predict(xg_test)
actual = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred != actual) / test_Y.shape[0]
print('Test error using softmax = {}'.format(error_rate))
# do the same thing again, but output probabilities
param['objective'] = 'multi:softprob'
bst = xgb.train(param, xg_train, num_round, watchlist)
# Note: this convention has been changed since xgboost-unity
# get prediction, this is in 1D array, need reshape to (ndata, nclass)
pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], num_classes)
pred_label = np.argmax(pred_prob, axis=1)
actual_label = np.argmax(test_Y, axis=1)
error_rate = np.sum(pred_label != actual_label) / test_Y.shape[0]
print('Test error using softprob = {}'.format(error_rate))
However I am observing that it is always predicting label 0, i.e. first index in the one-hot encoded output.
Output:
[0] train-merror:0.11081 test-merror:0.111076
[1] train-merror:0.11081 test-merror:0.111076
[2] train-merror:0.11081 test-merror:0.111076
[3] train-merror:0.111216 test-merror:0.111076
[4] train-merror:0.11081 test-merror:0.111076
Test error using softmax = 0.64846954875355
[0] train-merror:0.11081 test-merror:0.111076
[1] train-merror:0.11081 test-merror:0.111076
[2] train-merror:0.11081 test-merror:0.111076
[3] train-merror:0.111216 test-merror:0.111076
[4] train-merror:0.11081 test-merror:0.111076
Test error using softprob = 0.64846954875355
Prediction:
pred_prob[0:10]
array([[0.34024397, 0.10218474, 0.07965304, 0.07965304, 0.07965304,
0.07965304, 0.07965304, 0.07965304, 0.07965304],
[0.34009758, 0.10257103, 0.07961877, 0.07961877, 0.07961877,
0.07961877, 0.07961877, 0.07961877, 0.07961877],
[0.34421352, 0.09171014, 0.08058234, 0.08058234, 0.08058234,
0.08058234, 0.08058234, 0.08058234, 0.08058234],
[0.33950377, 0.10413795, 0.07947975, 0.07947975, 0.07947975,
0.07947975, 0.07947975, 0.07947975, 0.07947975],
[0.3426607 , 0.09580766, 0.08021881, 0.08021881, 0.08021881,
0.08021881, 0.08021881, 0.08021881, 0.08021881],
[0.33777002, 0.10427278, 0.07970817, 0.07970817, 0.07970817,
0.07970817, 0.07970817, 0.07970817, 0.07970817],
[0.33733884, 0.10985068, 0.07897293, 0.07897293, 0.07897293,
0.07897293, 0.07897293, 0.07897293, 0.07897293],
[0.33953893, 0.10404517, 0.07948799, 0.07948799, 0.07948799,
0.07948799, 0.07948799, 0.07948799, 0.07948799],
[0.33987975, 0.10314585, 0.07956778, 0.07956778, 0.07956778,
0.07956778, 0.07956778, 0.07956778, 0.07956778],
[0.34013695, 0.10246711, 0.07962799, 0.07962799, 0.07962799,
0.07962799, 0.07962799, 0.07962799, 0.07962799]], dtype=float32)
Whatever accuracy I'm getting is because of predicting label 0 which is around 35% of the data.
Is my expectation correct here? Are the input features too many and data too little for it to learn properly?
Full code: Here
Test Data: Here
32
in slicing. Now i got that. But my proposal was to get a larger training set (since it seems that xgboost can not find any pattern in the data) – Mischa Lisovyi