I just started to use feature selection on my dataset and I came across SelectFromModel module which automatically transformed an original n x m feature matrix into n x k where k << m. However, k is unknown a priori.
I wonder how I should use this to train a model and then use the existing model to predict on new data. As we know the training data instance and test data instance must be represented in a feature vector of the same dimensions.
But this dimension will depend on the data, and cannot be controlled with SelectFromModel.
I have written a code presented below:
X_train = ... # feature matrix
print("BEFORE FEATURE SELECTION, FEATURE MATRIX shape={}".format(X_train.shape))
# output of this line is 24771, 11680
select = SelectFromModel(LogisticRegression(class_weight='balanced',penalty="l1",C=0.01))
X_train = select.fit_transform(X_train, y_train)
print("AFTER FEATURE SELECTION, FEATURE MATRIX shape={}".format(M.shape))
# output of this line is 24771, 170
During testing, the pre-trained model is loaded, and new data instances need to be represented in the same feature vector:
X_test = ... # feature matrix
# the next line maps test set features to the feature vectors observed on training data, using corresponding vocabularies
X_test=map_test_to_train_featurevectors(X_test, X_train)
print("BEFORE FEATURE SELECTION, FEATURE MATRIX shape={}".format(X_test.shape))
# output of this line is 550, 11680, so test instances has same vector dimension as training instances
select = SelectFromModel(LogisticRegression(class_weight='balanced',penalty="l1",C=0.01))
X_test = select.fit_transform(X_test, y_test)
print("AFTER FEATURE SELECTION, FEATURE MATRIX shape={}".format(M.shape))
# output of this line is 550, 5, but the pre-trained model will expect 170
best_estimator = util.load_classifier_model(model_file)
prediction_dev = best_estimator.predict_proba(X_test)
The last line obviously generates the following error, because the resulting feature matrix after feature selection during training and testing are different dimensions:
ValueError: X has 5 features per sample; expecting 169
Does it mean that you cannot use SelectFromModel in this way? Can it only be used for training and evaluation?