0
votes

I have separated my data into train and test parts. My data table has a 'text' column. Consider that I have ten other columns representing numerical features. I have used TfidfVectorizer and the training data to generate term matrix and combine that with numerical features to create the training dataframe.

tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_features=5000, max_df=0.95)
tfidf_vectorizer_train = tfidf_vectorizer.fit_transform(X_train['text'].values)
df1_tfidf_train = pd.DataFrame(tfidf_vectorizer_train.toarray(), columns=tfidf_vectorizer.get_feature_names())
df2_train = df_main_ques.iloc[train_index][traffic_metrics]#to collect numerical features
df_combined_train = pd.concat([df1_tfidf_train, df2_train], axis=1)

To calculate the tf-idf score for test part, I need to reuse the training data set. I am not sure how to generate the test data part. Related post:

[1]Append tfidf to pandas dataframe: discuss only creating training dataset part

[2]How does TfidfVectorizer compute scores on test data: Discussed test data part but it is not clear how to generate the test dataframe that contains both terms and numerical features.

1

1 Answers

0
votes

you can use transform method of trained vectorizer for transforming your test data on already trained vectorizer. you can reuse the trained vectorizer for test data set TF-IDF score generation by

tfidf_vectorizer_test = tfidf_vectorizer.transform(X_test['text'].values)