5
votes

I am a little new to this. I am using a simple Logistic Regression Classifier in python scikit-learn. I have 4 features. My code is

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42) 
classifier = LogisticRegression(random_state = 0, C=100)
classifier.fit(X_train, y_train)
coef = classifier.coef_[0]
print (coef)

[-1.07091645 -0.07848958  0.66913624  0.43500273]
  • I want to know what does the coef array signifies?
  • can we use these coef * features to rank?
  • Does this mean that the last two features are most important in classifying results?
2
Take the absolute values to rank. Not the values given as is.Vivek Kumar
I edited the question, what I meant to say here from this array can we derive c1*f1 + c2*f2 + c3*f3 + c4 * f4 = some value Later we can rank using this valueNaufal Khalid
Isnt the same thing done by classifier.predict()Vivek Kumar
I don't know exactlyNaufal Khalid
"can we use these coef * features to rank": does this mean rank the output or rank the features?amanbirs

2 Answers

10
votes

I have answered your questions below but based on your questions and the comments, it looks like you're still learning about logistic regressions. I can recommend Advanced Data Analysis (http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/) which has a great chapter on logistic regressions as well as the Elements of Statistical Learning or Introduction to Statistical Learning textbooks to dive into the topic.

I want to know what does the coef array signifies?

The coefficient array is the list of coefficient values. The values are ordered by the order of columns in your X_train dataset. i.e. -1.07091645 is the coefficient value for the first column in X_train, -0.07848958 is the coefficient value for the second column and so on.

So, the equation from your comment will become:

-1.07091645*f1 + -0.07848958*f2 + 0.66913624*f3 + 0.43500273*f4

can we use these coef * features to rank?

I'm guessing you're trying to rank the importance of features, correct me if I misunderstood your question and I'll edit the post accordingly.

First, it is important to make sure that the variables you are using are comparable. For example, let's suppose that the first two variables in your dataset are age (in years) and income (in dollars).

This means that a one year increase in age will decrease the outcome variable by -1.07091645 and a one dollar increase in income will reduce the outcome by -0.07848958. Now the effect of a one year increase is considerably higher than a one dollar increase, but the unit increase for age (one year) cannot readily be compared to a unit increase for income (one dollar).

So in this case, is age more important than income? It's hard to say.

One common way to get around this is to scale each variable to the same range. This way at least you're comparing similar step-changes. However, this can make interpretation of the coefficient values more difficult since you're not sure what a one-unit change in a scaled variable corresponds to.

Does this mean that the last two features are most important in classifying results?

No. As @Vivek Kumar points in his comment, you should look at the absolute value. So in this case, if you feel the variables are comparable, then in order of importance it's 1, 3, 4, 2.

The logic is that even though the first variable has a negative coefficient, the effect of changing that variable is greater while keeping all other variables constant is greater than the effect of changing one of variables 2, 3 or 4.

5
votes

When you're doing simple logistic regression, you are trying to decide it Y is true/false, 1/0, yes/no … etc. Right?

You have these features X that presumable help you decide. The math behind basic logistic regression uses a sigmoid function (aka logistic function), which in Numpy/Python looks like:

y = 1/(1 + np.exp(-x) )

The x in this case is the linear combination of your features and coef:

coeaf[0] + coef[1] * feature[0] + coef[2] * coef[1] # etc.

As this increases the logistic function approaches 1 and as it decreases it approaches 0 asymptotically.

When you plug your coefficients and features into the logistic function it will spit out a number which is the probability that your sample will be true. How accurate it is depends on how well you modeled and fit the data. The goal of logistic regression is to find these coefficients that fit your data correctly and minimize error. Because the logistic function outputs probability, you can use it to rank least likely to most likely.

If you are using Numpy you can take a sample Xand your coefficients and plug them into the logistic equation with:

import numpy as np
X = np.array([1, .2 , .1, 1.5]) # one element from your data set
c = np.array([.5, .1, -.7, .2]) # the coefficients that (hopefully) minimize error
z = X @ c.T                     # matrix multiply - linear combination

y = 1/(1 + np.exp(-z)           # logistic function

y will be the probability that your model thinks this sample X is true.