0
votes

I just started to use feature selection on my dataset and I came across SelectFromModel module which automatically transformed an original n x m feature matrix into n x k where k << m. However, k is unknown a priori.

I wonder how I should use this to train a model and then use the existing model to predict on new data. As we know the training data instance and test data instance must be represented in a feature vector of the same dimensions.

But this dimension will depend on the data, and cannot be controlled with SelectFromModel.

I have written a code presented below:

X_train = ... # feature matrix

print("BEFORE FEATURE SELECTION, FEATURE MATRIX shape={}".format(X_train.shape))
# output of this line is 24771, 11680

select = SelectFromModel(LogisticRegression(class_weight='balanced',penalty="l1",C=0.01))
X_train = select.fit_transform(X_train, y_train)

print("AFTER FEATURE SELECTION, FEATURE MATRIX shape={}".format(M.shape))
# output of this line is 24771, 170

During testing, the pre-trained model is loaded, and new data instances need to be represented in the same feature vector:

X_test = ... # feature matrix

# the next line maps test set features to the feature vectors observed on training data, using corresponding vocabularies

X_test=map_test_to_train_featurevectors(X_test, X_train)

print("BEFORE FEATURE SELECTION, FEATURE MATRIX shape={}".format(X_test.shape))
# output of this line is 550, 11680, so test instances has same vector dimension as training instances

select = SelectFromModel(LogisticRegression(class_weight='balanced',penalty="l1",C=0.01))
X_test = select.fit_transform(X_test, y_test)

print("AFTER FEATURE SELECTION, FEATURE MATRIX shape={}".format(M.shape))
# output of this line is 550, 5, but the pre-trained model will expect 170

best_estimator = util.load_classifier_model(model_file)
prediction_dev = best_estimator.predict_proba(X_test)

The last line obviously generates the following error, because the resulting feature matrix after feature selection during training and testing are different dimensions:

ValueError: X has 5 features per sample; expecting 169

Does it mean that you cannot use SelectFromModel in this way? Can it only be used for training and evaluation?

1

1 Answers

0
votes

You cannot use the SelectFromModel() module in this way. It does not simply reduce the dimension complexity. By contrast, PCA().fit_transform(**data) does and will work the way you want. The documentation on SelectFromModel() says the following:

Meta-transformer for selecting features based on importance weights.

That being stated, you pass a desired model to SelectFromModel() and fit it with the input data. After you have done that, it extracts the most viable features based on the importance of model weights. In contrast to this approach, PCA() works with the explained variance of the data and strives to save as much of it as possible, while reducing the complexity of the data dimensionality. It is very handy since you may manually set the explained variance ratio to the original input data.

However, you try to transform (i. e. reduce) the dimensionality of the test data by implementing another fit of LogisticRegression(). It will calculate its own weights for this set as it absolutely differs from the set during the training stage. It thus chooses the most vital features based on the importance of the weights related to the test set, not the training set.


There are two approaches you may encounter the problem with:

  • Observe what 170 features are used after the select.fit_transform(X_train, y_train) and create a new variable with the same features from X_test. Thus, you will escape from an error pointing at a difference of sizes.

  • Implement a dimensionality reduction method, such as PCA or SVD

I hope that helps!