how scikit learn figure out logistic regression for classification or regression

Question

I think logistic regression could be used for both regression (get number between 0 and 1, e.g. using logistic regression to predict a probability between 0 and 1) and classification. The question is, it seems after we provide the training data and target, logistic regression could automatically figure out if we are doing a regression or doing a classification?

For example, in below example code, logistic regression figured out we just need output to be one of the 3 class 0, 1, 2, other than any number between 0 and 2? Just curious how logistic regression automatically figured out whether it is doing a regression (output is a continuous range) or classification (output is discrete) problem?

http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html

print(__doc__)


# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

h = .02  # step size in the mesh

logreg = linear_model.LogisticRegression(C=1e5)

# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()

I think logistic regression could be used for both regression [...] and classification - in principle yes, but if people say logistic regression, they always refer to the classification algorithm (yes, this is weird). The regression case is a special case of generalized linear models with a logit link funktion. — cel
@cel, nice catch and vote up. If I want logistic regression to output value between 0 and 1, how should I do? Suppose 0 means people not purchased something, 1 means people purchase something, I want to predict the probability of purchase using logistic regression. The target has only value 0 and 1 in my case, but I want to predict a float number between 0 and 1 for probability. — Lin Ma
@LinMa use logreg.predict_proba() scikit-learn.org/stable/modules/generated/… — joc
Thanks @joc, vote up. I think sigmoid function output continuous value between 0 and 1, but logistic regression in scikit learn by default output either 0 or 1 for a classification problem. Just curious how scikit learn automatically figure out and normalize output to either 0 or 1, is there a threshold scikit learn logistic regression utilizing underlying? Thanks. — Lin Ma

John Yetter John Yetter · Accepted Answer · 2016-08-25T16:11:23

Logistic regression often uses a cross-entropy cost function, which models loss according to a binary error. Also, the output of logistic regression usually follows a sigmoid at the decision boundary, meaning that while the decision boundary may be linear, the output (often viewed as a probability of the point representing one of two classes on either side of the boundary) transitions in non-linear fashion. This would make your regression model from 0 to 1 a very particular, non-linear function. That might be desirable in certain circumstances, but is probably not generally desirable.

You can think of logistic regression as providing an amplitude that represents probability of being in a class, or not. If you consider a binary classifier with two independent variables, you can picture a surface where the decision boundary is the topological line where probability is 0.5. Where the classifier is certain the of the class, the surface is either on the plateau (probability = 1) or in the low lying region (probability = 0). The transition from low probability regions to high follows a sigmoid function, usually.

You might look at Andrew Ng's Coursera course, which has a set of classes on logistic regression. This is the first of the classes. I have a github repo that is the R version of that class's output, here, which you might find helpful in understanding logistic regression better.

how scikit learn figure out logistic regression for classification or regression

1 Answers