0
votes

I have a pd.Dataframe dataset with columns:

index, text, label, ID

The ID is given in a specific way, so that the texts can be assigned into groups by ID. I Preprocessed the Text, made a Pipeline, did a sklearn.model_selection.train_test_split and predicted using gridsearch(gs.fit(x_train,y_train)). Now I have my original dataset, x_train, x_test, y_train, y_test and y_pred. Where y_pred is y_pred = gs_fit.best_estimator_.predict(x_test).

Here's what I want:

I want to now find out the corresponding ID's of my y_pred's to see if there are some groups from the ID's that were better predicted than others.

My problem here is, that I don't have the index anymore, the text is different after preprocessing and because of that I am not sure how I can find out which y_pred is connected to which ID. Any ideas how to do this?

1

1 Answers

1
votes

I believe the index is preserved. X_test.index == y_test.index

# ... assumes you split your DataFrame and no resetting of index

X, y = df[['feature_1','feature_2']], df[['target']]

# train_size or test_size
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=.8, random_state=42)

# index is kept. So the 
print(all(X_test.index == y_test.index)) # True

So just use index to locate the ID. The y_pred will be ordered in the same appearance of X_test.index.

# return only test IDs and their predictions
results = df[df.index == X_test.index][['ID']]
results['y_pred'] = y_pred
print(results)

If index is lost due to use of another data structure than Pandas, then another trick is to pre-append the ID onto text AC123 | text input here. Your text function would have to exclude all text prior to |. Do whatever text preprocessing with the latter. You can then use the first part to lookup the ID