I want to calculate (weighted) logistic regression in Python. The weights were calculated to adjust the distribution of the sample regarding the population. However, the results don“t change if I use weights.
import numpy as np
import pandas as pd
import statsmodels.api as sm
The data looks like this. The target variable is VISIT. The features are all other variables except WEIGHT_both (which is the weight I want to use).
df.head()
WEIGHT_both VISIT Q19_1 Q19_2 Q19_3 Q19_4 Q19_5 Q19_6 Q19_7 Q19_8 ... Q19_23 Q19_24 Q19_25 Q19_26 Q19_27 Q19_28 Q19_29 Q19_30 Q19_31 Q19_32
0 0.022320 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ... 4.0 4.0 1.0 1.0 1.0 1.0 2.0 3.0 3.0 2.0
1 0.027502 1.0 3.0 2.0 2.0 2.0 3.0 4.0 3.0 2.0 ... 3.0 2.0 2.0 2.0 2.0 4.0 2.0 4.0 2.0 2.0
2 0.022320 1.0 2.0 3.0 1.0 4.0 3.0 3.0 3.0 2.0 ... 3.0 3.0 3.0 2.0 2.0 1.0 2.0 2.0 1.0 1.0
3 0.084499 1.0 2.0 2.0 2.0 2.0 2.0 4.0 1.0 1.0 ... 2.0 2.0 1.0 1.0 1.0 2.0 1.0 2.0 1.0 1.0
4 0.022320 1.0 3.0 4.0 3.0 3.0 3.0 2.0 3.0 3.0 ... 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
Without the weight the model looks like this:
X = df.drop('WEIGHT_both', axis = 1)
X = X.drop('VISIT', axis = 1)
X = sm.add_constant(X)
w = = df['WEIGHT_both']
Y= df['VISIT']
fit = sm.Logit(Y, X).fit()
fit.summary()
Dep. Variable: VISIT No. Observations: 7971
Model: Logit Df Residuals: 7938
Method: MLE Df Model: 32
Date: Sun, 05 Jul 2020 Pseudo R-squ.: 0.2485
Time: 16:41:12 Log-Likelihood: -3441.2
converged: True LL-Null: -4578.8
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
const 3.8098 0.131 29.126 0.000 3.553 4.066
Q19_1 -0.1116 0.063 -1.772 0.076 -0.235 0.012
Q19_2 -0.2718 0.061 -4.483 0.000 -0.391 -0.153
Q19_3 -0.2145 0.061 -3.519 0.000 -0.334 -0.095
With the sample weight the result looks like this (no change):
fit2 = sm.Logit(Y, X, sample_weight = w).fit()
# same thing if I use class_weight
fit2.summary()
Dep. Variable: VISIT No. Observations: 7971
Model: Logit Df Residuals: 7938
Method: MLE Df Model: 32
Date: Sun, 05 Jul 2020 Pseudo R-squ.: 0.2485
Time: 16:41:12 Log-Likelihood: -3441.2
converged: True LL-Null: -4578.8
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
const 3.8098 0.131 29.126 0.000 3.553 4.066
Q19_1 -0.1116 0.063 -1.772 0.076 -0.235 0.012
Q19_2 -0.2718 0.061 -4.483 0.000 -0.391 -0.153
Q19_3 -0.2145 0.061 -3.519 0.000 -0.334 -0.095
I calculated the regression with other Programms (e.g. SPSS, R). The weighted result has to be different.
Here is an example (R-Code).
Without weights (same result as with Python code):
fit = glm(VISIT~., data = df[ -c(1)] , family = "binomial")
summary(fit)
Call:
glm(formula = VISIT ~ ., family = "binomial", data = df[-c(1)])
Deviance Residuals:
Min 1Q Median 3Q Max
-3.1216 -0.6984 0.3722 0.6838 2.1083
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.80983 0.13080 29.126 < 2e-16 ***
Q19_1 -0.11158 0.06296 -1.772 0.076374 .
Q19_2 -0.27176 0.06062 -4.483 7.36e-06 ***
Q19_3 -0.21451 0.06096 -3.519 0.000434 ***
Q19_4 0.22417 0.05163 4.342 1.41e-05 ***
With weights:
fit2 = glm(VISIT~., data = df[ -c(1)], weights = df$WEIGHT_both, family = "binomial")
summary(fit2)
Call:
glm(formula = VISIT ~ ., family = "binomial", data = df[-c(1)],
weights = df$WEIGHT_both)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4894 -0.3315 0.1619 0.2898 3.7878
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.950e-01 1.821e-01 2.718 0.006568 **
Q19_1 -6.497e-02 8.712e-02 -0.746 0.455835
Q19_2 -1.720e-02 8.707e-02 -0.198 0.843362
Q19_3 -1.114e-01 8.436e-02 -1.320 0.186743
Q19_4 1.898e-02 7.095e-02 0.268 0.789066
Any idea how to use weights in a logistic regression?