1
votes

I'm trying to fit a linear regression model using a greedy feature selection algorithm. To be a bit more specific, I have four sets of data:

X_dev, y_dev, X_test, y_test, the first two being the features and labels for the training set and the latter two for the test set. The size of the matrices are (900, 126), (900, ), (100, 126), and (100, ), respectively.

What I mean by "greedy feature selection" is that I would first fit 126 models using one feature each from the X_dev set, choose the best one, then run models using the first one and each of the remaining 125 models. The selection continues until I have obtained 100 of the features that perform best among the original 126.

The problem I'm facing is regarding implementation in Python. The code that I have is for fitting a single feature first:

lin_reg.fit(X_dev[:, 0].reshape(-1, 1), y_dev)
lin_pred = lin_reg.predict(X_test)

Because the dimensions don't match ((100, 126) and (1, )) I'm getting a dimension mismatch error.

How should I fix this? I'm trying to predict how the model performs when using the single feature.

Thank you.

1
Post the rest of your codeJohn R
Also, there is undoubtedly an implementation of this already built for you in scikit or some other library. Find it and save yourself the hassleJohn R
Unfortunately, this is basically the entire code that I have atm. The rest are the usual import statements for NumPy and sklearn with the np.load statements for the data. I'll take your advice though and look for an implementation.Sean

1 Answers

1
votes

Apply the same transformation to X_test

lin_reg.fit(X_dev[:, 0].reshape(-1, 1), y_dev)
lin_pred = lin_reg.predict(X_test[:, 0].reshape(-1, 1))

I also don’t think the reshape is necessary.

lin_reg.fit(X_dev[:, 0], y_dev)
lin_pred = lin_reg.predict(X_test[:, 0])

Should work as well