I have separated my data into train and test parts. My data table has a 'text' column. Consider that I have ten other columns representing numerical features. I have used TfidfVectorizer and the training data to generate term matrix and combine that with numerical features to create the training dataframe.
tfidf_vectorizer=TfidfVectorizer(use_idf=True, max_features=5000, max_df=0.95)
tfidf_vectorizer_train = tfidf_vectorizer.fit_transform(X_train['text'].values)
df1_tfidf_train = pd.DataFrame(tfidf_vectorizer_train.toarray(), columns=tfidf_vectorizer.get_feature_names())
df2_train = df_main_ques.iloc[train_index][traffic_metrics]#to collect numerical features
df_combined_train = pd.concat([df1_tfidf_train, df2_train], axis=1)
To calculate the tf-idf score for test part, I need to reuse the training data set. I am not sure how to generate the test data part. Related post:
[1]Append tfidf to pandas dataframe: discuss only creating training dataset part
[2]How does TfidfVectorizer compute scores on test data: Discussed test data part but it is not clear how to generate the test dataframe that contains both terms and numerical features.