0
votes

I'm trying to predict the no.of updates('sys_mod_count')based on the text description('eng')

I have predefined the 'sys_mod_count' into two classes if >=17 as 1; <17 as 0.

But I want to remove this condition as this value is not available at decision time in real world.

I'm thinking to do this in Decision tree/ Random forest method to train the classifier on feature set.


def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False):
    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)
    # predict the labels on validation dataset
    predictions = classifier.predict(feature_vector_valid)
    # return metrics.accuracy_score(predictions, valid_y)
    return predictions

import pandas as pd
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

df_3 =pd.read_csv('processedData.csv', sep=";")
st_new = df_3[['sys_mod_count','eng','ger']]
st_new['updates_binary'] = st_new['sys_mod_count'].apply(lambda x: 1 if x >= 17 else 0)
st_org = st_new[['eng','updates_binary']]
st_org = st_org.dropna(axis=0, subset=['eng']) #Determine if column 'eng'contain missing values are removed
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(st_org['eng'], st_org['updates_binary'],stratify=st_org['updates_binary'],test_size=0.20)
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(st_org['eng'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

# Naive Bayes on Word Level TF IDF Vectors
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf)
print ("NB, WordLevel TF-IDF: ", metrics.accuracy_score(accuracy, valid_y))


1
It is not clear what your question is. - Abhineet Gupta
@AbhineetGupta I want to let the classifier decide 'updates_Binary' value with decision tree or Random Forest method, rather than pre defining this value as above in Naive Bayern method. - Adolf

1 Answers

0
votes

This seems to be a threshold setting problem - you would like to set a threshold at which a certain classification is made. No supervised classifier can set the threshold for you because if it does not have any training data with binary classes, then you cannot train the cvlassifier, and to create training data, you need to set the threshold to begin with. It's a chicken and egg problem.

If you have some way of identifying which binary label is correct, then you can vary the threshold and measure errors similar to how it's suggested here. Then you can either run a Classifier on your binary labels based on the threshold or a Regressor on sys_mod_count and convert to binary based on the identified threshold.

The above approach does not work if you have no way to identify what the correct binary label should be. Then, the problem you are trying to solve is creating some boundary between points based on the value of your sys_mod_count variable. This is unsupervised learning. So, techniques like clustering will be helpful here. You can cluster your data into two clusters based on the distance of points from each other, and then label each cluster, which becomes your binary label.