Reference group in PySpark multinomial regression

Question

Does anyone know what the default reference group is in a Pyspark multinomial logistic regression. For instance, we have multiclass outcomes/target of A, B, C, and D.

How does spark choose the reference category? In standard logistic regression in other software (e.g. R, SAS) you can set the reference group yourself. So if your reference is A, you get n-1 models fitted together and having the target classes modeled as A vs B, A vs C, and A vs D.

You want to control this process because if an outcome with a low number of values (small sample of observation) is set as a reference the estimates are going to be unstable.

Here is the link to the multinomial logistic regression model in pyspark. Here the outcome classes are 0, 1, 2 but no clarity on what the reference is. I am assuming it may be zero but not sure of that.

andrew andrew · Accepted Answer · 2018-09-26T20:48:55

I believe that by default, it does not by use a reference group. This is why, if you run the snip from your link, you find non-zero values for all intercepts.

From the scala source: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala

Note that there is a difference between multinomial (softmax) and binary loss. The binary case

uses one outcome class as a "pivot" and regresses the other class against the pivot. In the

multinomial case, the softmax loss function is used to model each class probability

independently. Using softmax loss produces K sets of coefficients, while using a pivot class

produces K - 1 sets of coefficients (a single coefficient vector in the binary case). In the

binary case, we can say that the coefficients are shared between the positive and negative

classes...

It goes on to talk about how coefficients are not generally identifiable (which is why one would pick a reference label), but that when the regularization is applied the coefficients do become identifiable.

Reference group in PySpark multinomial regression

1 Answers