2
votes

Checking online in here and here I see there are two ways to estimate odds ratio in python but the results are different.

First way:

import scipy.stats as stats
import pandas as pd
df=pd.DataFrame({'c':['m','m','m','m','f','f','f','f'],'l':[1,1,1,0,0,0,0,1]})
ct=pd.crosstab(df.c,df.l)
oddsratio, pvalue = stats.fisher_exact(ct)

Second way:

from sklearn.linear_model import LogisticRegression
df=pd.get_dummies(df,drop_first=True)
clf = LogisticRegression()
clf.fit(df[['c_m']],df[['l']].values)
odds_ratio=np.exp(clf.coef_)

First approach return odds ratio=9 and second approach returns odds ratio=1.9. I am relatively new to the concept of odds ratio and I am not sure how fisher test and logistic regression could be used to obtain the same value, what is the difference and which method is correct approach to get the odds ratio in this case. I would appreciate any hint. thanks.

1

1 Answers

3
votes

Short answer:

In both cases, you should get the same odds ratio of 9.

By default, penality is 'L2' in sklearn logistic regression model which distorts the value of coefficients (regularization), so if you use penality='none, you will get the same matching odds ratio.

so change to

clf = LogisticRegression(penalty='none')

and calculate the odds_ratio

Long Answer:

In the first case, Odd's ratio is the prior odds ratio and is made from the contingency/crosstabulation table and is calculated as shown below

Contingency table for the df would be

    l   0   1
c       
f       3   1
m       1   3

odds ratio = odds of f being 0 / odds of m being 0

odds of f being 0 = P(f=0)/P(f=1) = (3/4) / (1/4)

odds of m being 0 = P(m=0)/P(m=1) = (1/4) / (3/4)

odds ratio = ((3/4)/(1/4)) / ((1/4)/(3/4)) = 9

In the second case, you are getting the estimate of odds ratio by fitting logistic regression model. You will get odds ratio = 9 if you use penality = 'none'. By default, penality in logisticregression estimator is 'L2'.

from sklearn.linear_model import LogisticRegression
df=pd.get_dummies(df,drop_first=True)
clf = LogisticRegression(penalty='none')
clf.fit(df[['c_m']],df[['l']].values)
odds_ratio=np.exp(clf.coef_)

print(odd_ratio)

array([[9.0004094]])

You can also get odds ratio by another method, which also results in same odds ratio. see

#Method 2: 
odds_of_yis_1_for_female = np.exp(clf.intercept_+clf.coef_*1) #logit for female
odds_of_yis_1_for_male = np.exp(clf.intercept_+clf.coef_*0) # logit for male
odds_ratio_2 = odds_of_yis_1_for_female/odds_of_yis_1_for_male
print(odds_ratio_2)

[[9.0004094]]

To understand why both methods are same, see here