Does anyone know what the default reference group is in a Pyspark multinomial logistic regression. For instance, we have multiclass outcomes/target of A, B, C, and D.
How does spark choose the reference category? In standard logistic regression in other software (e.g. R, SAS) you can set the reference group yourself. So if your reference is A, you get n-1 models fitted together and having the target classes modeled as A vs B, A vs C, and A vs D.
You want to control this process because if an outcome with a low number of values (small sample of observation) is set as a reference the estimates are going to be unstable.
Here is the link to the multinomial logistic regression model in pyspark. Here the outcome classes are 0, 1, 2 but no clarity on what the reference is. I am assuming it may be zero but not sure of that.