I'm trying to reproduce the following R results in Python. In this particular case the R predictive skill is lower than the Python skill, but this is usually not the case in my experience (hence the reason for wanting to reproduce the results in Python), so please ignore that detail here.
The aim is to predict the flower species ('versicolor' 0 or 'virginica' 1). We have 100 labelled samples, each consisting of 4 flower characteristics: sepal length, sepal width, petal length, petal width. I've split the data into training (60% of data) and test sets (40% of data). 10-fold cross-validation is applied to the training set to search for the optimal lambda (the parameter that is optimized is "C" in scikit-learn).
I'm using glmnet in R with alpha set to 1 (for the LASSO penalty), and for python, scikit-learn's LogisticRegressionCV function with the "liblinear" solver (the only solver that can be used with L1 penalisation). The scoring metrics used in the cross-validation are the same between both languages. However somehow the model results are different (the intercepts and coefficients found for each feature vary quite a bit).
R Code
library(glmnet)
library(datasets)
data(iris)
y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2
n_sample = NROW(X)
w = .6
X_train = X[0:(w * n_sample),] # (60, 4)
y_train = y[0:(w * n_sample)] # (60,)
X_test = X[((w * n_sample)+1):n_sample,] # (40, 4)
y_test = y[((w * n_sample)+1):n_sample] # (40,)
# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
nfolds = 10, alpha=1, family="binomial", type.measure="class")
best_s <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))
# best lambda
print(best_s)
# 0.04136537
# fraction correct
print(sum(y_test==pred)/NROW(pred))
# 0.75
# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept) -14.680479
#Sepal.Length 0
#Sepal.Width 0
#Petal.Length 1.181747
#Petal.Width 4.592025
Python Code
from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0] # four features. Disregard one of the 3 species.
y = y[y != 0]-1 # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.
n_sample = len(X)
w = .6
X_train = X[:int(w * n_sample)] # (60, 4)
y_train = y[:int(w * n_sample)] # (60,)
X_test = X[int(w * n_sample):] # (40, 4)
y_test = y[int(w * n_sample):] # (40,)
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test) # score is 0.775
print clf.intercept_ #-1.83569557
print clf.coef_ # [ 0, 0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_ # optimal lambda: 0.35938137