Getting weights of features using scikit-learn Logistic Regression

Question

I am a little new to this. I am using a simple Logistic Regression Classifier in python scikit-learn. I have 4 features. My code is

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42) 
classifier = LogisticRegression(random_state = 0, C=100)
classifier.fit(X_train, y_train)
coef = classifier.coef_[0]
print (coef)

[-1.07091645 -0.07848958  0.66913624  0.43500273]

I want to know what does the coef array signifies?
can we use these coef * features to rank?
Does this mean that the last two features are most important in classifying results?

Take the absolute values to rank. Not the values given as is. — Vivek Kumar
I edited the question, what I meant to say here from this array can we derive c1*f1 + c2*f2 + c3*f3 + c4 * f4 = some value Later we can rank using this value — Naufal Khalid
"can we use these coef * features to rank": does this mean rank the output or rank the features? — amanbirs

amanbirs amanbirs · Accepted Answer · 2017-11-16T11:33:12

I have answered your questions below but based on your questions and the comments, it looks like you're still learning about logistic regressions. I can recommend Advanced Data Analysis (http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/) which has a great chapter on logistic regressions as well as the Elements of Statistical Learning or Introduction to Statistical Learning textbooks to dive into the topic.

I want to know what does the coef array signifies?

The coefficient array is the list of coefficient values. The values are ordered by the order of columns in your X_train dataset. i.e. -1.07091645 is the coefficient value for the first column in X_train, -0.07848958 is the coefficient value for the second column and so on.

So, the equation from your comment will become:

-1.07091645*f1 + -0.07848958*f2 + 0.66913624*f3 + 0.43500273*f4

can we use these coef * features to rank?

I'm guessing you're trying to rank the importance of features, correct me if I misunderstood your question and I'll edit the post accordingly.

First, it is important to make sure that the variables you are using are comparable. For example, let's suppose that the first two variables in your dataset are age (in years) and income (in dollars).

This means that a one year increase in age will decrease the outcome variable by -1.07091645 and a one dollar increase in income will reduce the outcome by -0.07848958. Now the effect of a one year increase is considerably higher than a one dollar increase, but the unit increase for age (one year) cannot readily be compared to a unit increase for income (one dollar).

So in this case, is age more important than income? It's hard to say.

One common way to get around this is to scale each variable to the same range. This way at least you're comparing similar step-changes. However, this can make interpretation of the coefficient values more difficult since you're not sure what a one-unit change in a scaled variable corresponds to.

Does this mean that the last two features are most important in classifying results?

No. As @Vivek Kumar points in his comment, you should look at the absolute value. So in this case, if you feel the variables are comparable, then in order of importance it's 1, 3, 4, 2.

The logic is that even though the first variable has a negative coefficient, the effect of changing that variable is greater while keeping all other variables constant is greater than the effect of changing one of variables 2, 3 or 4.

Getting weights of features using scikit-learn Logistic Regression

2 Answers