2
votes

I am trying to fit a multinomial logistic regression and then predicting the result from samples.

$$ \Gamma(z) = \int_0^\infty t^{z-1}e^{-t}dt,. $$

### RZS_TC is my dataframe
RZS_TC.loc[RZS_TC['Mean_Treecover'] <= 50, 'Mean_Treecover' ] = 0
RZS_TC.loc[RZS_TC['Mean_Treecover'] > 50, 'Mean_Treecover' ] = 1
RZS_TC[['MAP']+['Sr']+['delTC']+['Mean_Treecover']].head()

[Output]:
                 MAP        Sr       delTC  Mean_Treecover
302993741   2159.297363 452.975647  2.666672    1.0
217364332   3242.351807 65.615341   8.000000    1.0
390863334   1617.215454 493.124054  5.666666    0.0
446559668   1095.183105 498.373383  -8.000000   0.0
246078364   2804.615234 98.981110   -4.000000   1.0
1000000 rows × 7 columns

#Fitting a logistic regression
from statsmodels.formula.api import mnlogit
model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()

print(model.summary2())
[Output]:
                          Results: MNLogit
====================================================================
Model:                MNLogit          Pseudo R-squared: 0.364      
Dependent Variable:   Mean_Treecover   AIC:              831092.4595
Date:                 2021-04-02 13:51 BIC:              831139.7215
No. Observations:     1000000          Log-Likelihood:   -4.1554e+05
Df Model:             3                LL-Null:          -6.5347e+05
Df Residuals:         999996           LLR p-value:      0.0000     
Converged:            1.0000           Scale:            1.0000     
No. Iterations:       7.0000                                        
--------------------------------------------------------------------
Mean_Treecover = 0  Coef.  Std.Err.     t     P>|t|   [0.025  0.975]
--------------------------------------------------------------------
         Intercept -5.2200   0.0119 -438.4468 0.0000 -5.2434 -5.1967
               MAP  0.0023   0.0000  491.0859 0.0000  0.0023  0.0023
                Sr  0.0016   0.0000   90.6805 0.0000  0.0015  0.0016
             delTC -0.0093   0.0002  -39.9022 0.0000 -0.0098 -0.0089

However, wherever I try to predict the using the model.predict() function, I get the following error.

prediction = model.predict(np.array(RZS_TC[['MAP']+['Sr']+['delTC']]))
[Output]: ERROR! Session/line number was not unique in database. History logging moved to new session 2627

Does anyone know how to troubleshoot this? Is there something that I might be doing wrong?

1

1 Answers

1
votes

The model adds an intercept so you need to include that, using an example data:

from statsmodels.formula.api import mnlogit
import pandas as pd
import numpy as np
RZS_TC = pd.DataFrame(np.random.uniform(0,1,(20,4)),
columns=['MAP','Sr','delTC','Mean_Treecover'])

RZS_TC['Mean_Treecover'] = round(RZS_TC['Mean_Treecover'])

model = mnlogit("Mean_Treecover ~ MAP + Sr + delTC", RZS_TC).fit()

You can see the dimensions of your fitted data:

model.model.exog[:5,]
Out[16]: 
array([[1.        , 0.33914763, 0.79358056, 0.3103758 ],
       [1.        , 0.45915785, 0.94991271, 0.27203524],
       [1.        , 0.55527662, 0.15122108, 0.80675951],
       [1.        , 0.18493681, 0.89854583, 0.66760684],
       [1.        , 0.38300074, 0.6945397 , 0.28128137]])

Which is the same as if you add a constant:

import statsmodels.api as sm
sm.add_constant((RZS_TC[['MAP','Sr','delTC']])

    const       MAP        Sr     delTC
0     1.0  0.339148  0.793581  0.310376
1     1.0  0.459158  0.949913  0.272035
2     1.0  0.555277  0.151221  0.806760
3     1.0  0.184937  0.898546  0.667607

If you have a data.frame with the same column names, it will just be:

prediction = model.predict(RZS_TC[['MAP','Sr','delTC']])

Or if you just need the fitted values, do:

model.fittedvalues