I have a pd.Dataframe
dataset with columns:
index, text, label, ID
The ID is given in a specific way, so that the texts can be assigned into groups by ID.
I Preprocessed the Text, made a Pipeline
, did a sklearn.model_selection.train_test_split
and predicted using gridsearch(gs.fit(x_train,y_train)
).
Now I have my original dataset, x_train, x_test, y_train, y_test and y_pred. Where y_pred is y_pred = gs_fit.best_estimator_.predict(x_test)
.
Here's what I want:
I want to now find out the corresponding ID's of my y_pred's to see if there are some groups from the ID's that were better predicted than others.
My problem here is, that I don't have the index anymore, the text is different after preprocessing and because of that I am not sure how I can find out which y_pred is connected to which ID. Any ideas how to do this?