4
votes

Not sure if this is a great place for this question, but I was told CrossValidated was not. So, all these questions refer to sklearn, but if you have insights into logistic regression in general, I'd love to hear them as well.

1) Does data have to be standardizes(mean 0, stdev 1)?
2) In sklearn, how do I specify what kind of regularization I want (L1 vs L2)? Note that this is different from penalty; penalty refers to classification error, not pentalty on coefficients.
3) How can I use to also do variable selection? I.e., analogously to lasso for linear regression.
4) When using regularization, how do I optimize for C, the regularization strength? Is there something built-in, or do I have to take care of this myself?

Probably an example would be most helpful, but I'd appreciate any insights on any of these questions.

This has been my starting point: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Thank you very much in advance!

1

1 Answers

5
votes

1) For logistic regression, no. You are not computing distances between instances.

2) You can specify the penalty='l1' or penalty='l2' parameter. See the LogisticRegression page. L2 penalty is default.

3) There are various explicit feature selection techniques that scikit-learn provides, e.g. using SelectKBest with a chi2 ranking function.

4) You will want to do a Grid Search for the optimal parameter.

For more detail on all these questions, I suggest going through some of the Examples, e.g. this one and this one.