so i have a banking data set, where i have to predict if customer would take a term deposit or not. i have a column called job; which is categorical and has the job types of each customer. i am currently in the EDA process and want to make out which job category contributes the most towards a positive prediction.
i intend to do this with logistic regression (not sure if this is right, alternative method sugestions are welcome).
so here is what i did;
i did one k- hot encoding for each job category(and have 1-0 values for each job type), and the target i Did k-1 one hot, and have 1-0 values for Target_yes(1 = the customer accepted the term deposit and 0 the customer did not accept).
job_management job_technician job_entrepreneur job_blue-collar job_unknown job_retired job_admin. job_services job_self-employed job_unemployed job_housemaid job_student
0 1 0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
45206 0 1 0 0 0 0 0 0 0 0 0 0
45207 0 0 0 0 0 1 0 0 0 0 0 0
45208 0 0 0 0 0 1 0 0 0 0 0 0
45209 0 0 0 1 0 0 0 0 0 0 0 0
45210 0 0 1 0 0 0 0 0 0 0 0 0
45211 rows × 12 columns
The target column looks like this;
0 0
1 0
2 0
3 0
4 0
..
45206 1
45207 1
45208 1
45209 0
45210 0
Name: Target_yes, Length: 45211, dtype: int32
I fit this to a sklearn logistic regression model and got the coefficients. Unable to interpret them, i looked for an alternative and came across stat model version. did the same with the logit function. In the example i saw online, he had used sm.get_constants for the x variable.
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
model = LogisticRegression(solver='liblinear')
model.fit(vari,tgt)
model.score(vari,tgt)
df = pd.DataFrame(model.coef_)
df['inter'] = model.intercept_
print(df)
The model score and print()df results are as follows:
0.8830151954170445(model score)
print(df)
0 1 2 3 4 5 6 \
0 -0.040404 -0.289274 -0.604957 -0.748797 -0.206201 0.573717 -0.177778
7 8 9 10 11 inter
0 -0.530802 -0.210549 0.099326 -0.539109 0.879504 -1.795323
When i use sm.get_constats, i get coefficients similar to the sklearn logisticRegression, but the Zscores, (which i intended to use to find the job type which contributes the most towards a positive prediction) becomes nan.
import statsmodels.api as sm
logit = sm.Logit(tgt, sm.add_constant(vari)).fit()
logit.summary2()
Results are:
E:\Programs\Anaconda\lib\site-packages\numpy\core\fromnumeric.py:2495: FutureWarning:
Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
E:\Programs\Anaconda\lib\site-packages\statsmodels\base\model.py:1286: RuntimeWarning:
invalid value encountered in sqrt
E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:
invalid value encountered in greater
E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:901: RuntimeWarning:
invalid value encountered in less
E:\Programs\Anaconda\lib\site-packages\scipy\stats\_distn_infrastructure.py:1892: RuntimeWarning:
invalid value encountered in less_equal
Optimization terminated successfully.
Current function value: 0.352610
Iterations 13
Model: Logit Pseudo R-squared: 0.023
Dependent Variable: Target_yes AIC: 31907.6785
Date: 2019-11-18 10:17 BIC: 32012.3076
No. Observations: 45211 Log-Likelihood: -15942.
Df Model: 11 LL-Null: -16315.
Df Residuals: 45199 LLR p-value: 3.9218e-153
Converged: 1.0000 Scale: 1.0000
No. Iterations: 13.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
const -1.7968 nan nan nan nan nan
job_management -0.0390 nan nan nan nan nan
job_technician -0.2882 nan nan nan nan nan
job_entrepreneur -0.6092 nan nan nan nan nan
job_blue-collar -0.7484 nan nan nan nan nan
job_unknown -0.2142 nan nan nan nan nan
job_retired 0.5766 nan nan nan nan nan
job_admin. -0.1766 nan nan nan nan nan
job_services -0.5312 nan nan nan nan nan
job_self-employed -0.2106 nan nan nan nan nan
job_unemployed 0.1011 nan nan nan nan nan
job_housemaid -0.5427 nan nan nan nan nan
job_student 0.8857 nan nan nan nan nan
if i use the Stat models logit without the sm.get_constats, i get coefficients that are very different from the sklearn Logistic regression, but i get values for the zscore (which are all negative)
import statsmodels.api as sm
logit = sm.Logit(tgt, vari).fit()
logit.summary2()
results are:
Optimization terminated successfully.
Current function value: 0.352610
Iterations 6
Model: Logit Pseudo R-squared: 0.023
Dependent Variable: Target_yes AIC: 31907.6785
Date: 2019-11-18 10:18 BIC: 32012.3076
No. Observations: 45211 Log-Likelihood: -15942.
Df Model: 11 LL-Null: -16315.
Df Residuals: 45199 LLR p-value: 3.9218e-153
Converged: 1.0000 Scale: 1.0000
No. Iterations: 6.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
job_management -1.8357 0.0299 -61.4917 0.0000 -1.8943 -1.7772
job_technician -2.0849 0.0366 -56.9885 0.0000 -2.1566 -2.0132
job_entrepreneur -2.4060 0.0941 -25.5563 0.0000 -2.5905 -2.2215
job_blue-collar -2.5452 0.0390 -65.2134 0.0000 -2.6217 -2.4687
job_unknown -2.0110 0.1826 -11.0120 0.0000 -2.3689 -1.6531
job_retired -1.2201 0.0501 -24.3534 0.0000 -1.3183 -1.1219
job_admin. -1.9734 0.0425 -46.4478 0.0000 -2.0566 -1.8901
job_services -2.3280 0.0545 -42.6871 0.0000 -2.4349 -2.2211
job_self-employed-2.0074 0.0779 -25.7739 0.0000 -2.1600 -1.8547
job_unemployed -1.6957 0.0765 -22.1538 0.0000 -1.8457 -1.5457
job_housemaid -2.3395 0.1003 -23.3270 0.0000 -2.5361 -2.1429
job_student -0.9111 0.0722 -12.6195 0.0000 -1.0526 -0.7696
Which of the two is better? or should i use a completly different approach ?