1
votes

I have a strange issue already mentioned here: LinearSVC Feature Selection returns different coef_ in Python

but I cannot really relate to that.

I have a Regularised L1 logistic regression that I am using for feature selection. When I simply rerun the code the number of the feature selected changes. The target variable is binary 1, 0. The number of feature is 709. The training observations are 435, so the feature are more than the observations. The penalty C has been obtained through TimeSeriesSplit CV and never changes when I rerun, I verified that.

Below the code for the feature selection part..

X=df_training_features 
y=df_training_targets
lr_l1 = LogisticRegression(C = LR_penalty.C, max_iter=10000,class_weight=None, dual=False, 
                           fit_intercept=True, intercept_scaling=1, l1_ratio=None, n_jobs=None, 
                           penalty='l1', random_state=None, solver='liblinear', tol=0.0001, verbose=0, 
                           warm_start=False).fit(X,y)
model = SelectFromModel(lr_l1, threshold=1e-5, prefit=True)

feature_idx = model.get_support() 
feature_name = X.columns[feature_idx]
X_new = model.transform(X)

# Plot
importance = lr_l1.coef_[0]
for i,v in enumerate(importance):
    if np.abs(v)>=1e-5:
        print('Feature: %0d, Score: %.5f' % (i,v))
sel = importance[np.abs(importance)>=1e-5]
# plot feature importance
plt.figure(figsize=(12, 10))
pyplot.bar([x for x in feature_name], sel)
pyplot.xticks(fontsize=10, rotation=70)
pyplot.ylabel('Feature Importance', fontsize = 14)
pyplot.show()

enter image description here

enter image description here

As seen above, the result sometimes gives me 22 feature selected (first plot), and some other times 24 (second plot), or 23. Not sure what is happening. I thought the issue was in the SelectFromModel so I decided to explicitly state the threshold 1e-5 (which is the default for l1 regularisation), but nothing changes.

It is always the same features which are sometimes in and sometimes out so I checked their coefficients as I thought they might be close to that threshold instead they are not (1 or 2 order of magnitude higher).

Can please anybody help? I have been struggling more than a day on this

1
Try fixing random_state=42Sergey Bushmanov

1 Answers

1
votes

You used solver=liblinear. From the documentation:

random_state : int, RandomState instance, default=None

Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. See Glossary for details.

So try setting a fixed value for random_state and you should converge to the same results.

After a very quick search, I found liblinear uses coordinate descent to minimize the cost function (source). This means that it will choose a random set of coefficients and minimize the cost function one step at a time. I suppose your results are slightly different because they each started at different points.