I'm trying to fit a linear regression model using a greedy feature selection algorithm. To be a bit more specific, I have four sets of data:
X_dev
, y_dev
, X_test
, y_test
, the first two being the features and labels for the training set and the latter two for the test set. The size of the matrices are (900, 126)
, (900, )
, (100, 126)
, and (100, )
, respectively.
What I mean by "greedy feature selection" is that I would first fit 126 models using one feature each from the X_dev
set, choose the best one, then run models using the first one and each of the remaining 125 models. The selection continues until I have obtained 100 of the features that perform best among the original 126.
The problem I'm facing is regarding implementation in Python. The code that I have is for fitting a single feature first:
lin_reg.fit(X_dev[:, 0].reshape(-1, 1), y_dev)
lin_pred = lin_reg.predict(X_test)
Because the dimensions don't match ((100, 126)
and (1, )
) I'm getting a dimension mismatch error.
How should I fix this? I'm trying to predict how the model performs when using the single feature.
Thank you.
import
statements for NumPy and sklearn with thenp.load
statements for the data. I'll take your advice though and look for an implementation. – Sean