0
votes

I am compiling the framework for a logistic regression with classifiers. Can someone help me to validate it and suggest the major library (sklearn, for instance) functions? Here is what I came up with:

  1. Run logistic regression from sklearn for N observations and M variables (M < N)

    train set - about 80% of the total dataset test set - the remaining 20%

    Q: is there a function, which would allow to select the test set as an extrapolation of a train set rather than using a random selection? (train_test_split does no do this)

    Q: is there a function which will let to run logistic regression with regularization? StandardScaler maybe?

  2. When Logistic Regression is complete how do we use the results:

    do we use just a decision boundary plot and make a decision about our new data point based on whether it is IN or OUT of the plot?

    I can get the coefficients but what is the formula to calculate the target? Is it a linear polinom under the sigmoid umbrella? Is this a way to go?

    Is there a function to calculate a probability of our decision being a correct one (Yes or No)? I can get the error using score attribute (KNeighborsClassifier). There is also predict.proba attribute but I am not sure to interpret it. There is also a confusion matrix and probability can be calculated using its numbers. What is the right way?

  3. Aside from Logistic Regression there are other functions used, such as:

KNeighborsClassifier LDA and others

What role do they play vs. Logistic Regression and how they must be used?

Thank you

1
That's a lot of questions dude!farhawa
@farhawa - I will take partial answers with a gratitude!:)Toly

1 Answers

1
votes

Most of your questions can be addressed by reading the sklearn linear model's Logistic Regression page. You have not mentioned any number of classes so I'm going to answer your questions assuming two classes (binary).

Here are my suggestions:

Can someone help me to validate it and suggest the major library (sklearn, for instance) functions?

sklearn has a few choices when it comes to Logistic Regression. Since you mentioned you are using Logistic Regression for classification I will limit my suggestions to the following:

  1. sklearn.linear_model.LogisticRegression
  2. sklearn.linear_model.SGDClassifier

I'm assuming you know the basics of Logistic Regression. The difference between LogisticRegression and SGDClasisfier is the solver used to estimate the coefficients of the regressors. LogisticRegression estimates the regressors using ‘newton-cg’, 'lbfgs’, ‘liblinear’, or ‘sag’. The default is set to 'liblinear', but you can change this by changing the solver parameter. SGDClassifier uses a stochastic gradient descent solver. For a more detailed explanation of differences, refer to the links provided.

is there a function which will let to run logistic regression with regularization?

The above three suggestions all use the parameter penalty to set the regularization type.

When Logistic Regression is complete how do we use the results?

Once logistic regression is complete, predict_proba(X) can be used to determine the "probability of belonging to a class" of each observation of X (where samples are stored row-wise). predict_proba(X) will return a a Nx2 array where the first column is interpreted as the "probability of belonging to the negative class" and the second column is interpreted as the "probability of belonging to the positive class". For example, if you are interested in just the probability of belonging to the positive class you would look at only the second column.

The second column of predict_proba(X) is equivalent to sigmoid(coef_*X+intercept_), where sigmoid is just the sigmoid function.

There is also a confusion matrix and probability can be calculated using its numbers. What is the right way?

The confusion matrix is an error metric that can be used to determine how many observations are classified correctly and incorrectly (and in what way are they incorrect or correct; true positive/true negative/false positive/false negative). Since the outcome of logistic regression is a probability, you need to threshold your values (i.e. at 0.5) to "assign" which class each observation belongs to. Once you've done this, then you can use the confusion matrix. There is no "right" way to calculate error. There are a number of error metrics that can be used. The first page of Damien Françios' error cheatsheet gives you various options for error metrics used in binary classification. The one you ultimately go with depends on several factors like the cost of each error, whether there are the same number of training observations in each class, etc.