4
votes

Problem: The default implementations (no custom parameters set) of the logistic regression model in pyspark and scikit-learn seem to yield different results given their default paramter values.

I am trying to replicate a result from logistic regression (no custom paramters set) performed with pypark (see: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression) with the logistic regression model from scikit-learn (see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

It appears to me that both model implementations (in pyspark and scikit) do not possess the same parameters, so i cant just simply match the paramteres in scikit to fit those in pyspark. Is there any solution on how to match both models on their default configuration?

Parameters Scikit model (default parameters):

`LogisticRegression(
C=1.0, 
class_weight=None, 
dual=False, 
fit_intercept=True,
intercept_scaling=1, 
max_iter=100, 
multi_class='ovr', 
n_jobs=1,
penalty='l2', 
random_state=None, 
solver='liblinear', 
tol=0.0001,
verbose=0, 
warm_start=False`

Parameters Pyspark model (default parameters):

LogisticRegression(self, 
featuresCol="features", 
labelCol="label", 
predictionCol="prediction", 
maxIter=100,
regParam=0.0, 
elasticNetParam=0.0, 
tol=1e-6, 
fitIntercept=True, 
threshold=0.5, 
thresholds=None, 
probabilityCol="probability", 
rawPredictionCol="rawPrediction", 
standardization=True, 
weightCol=None, 
aggregationDepth=2, 
family="auto")

Thank you very much!

2
Can you point out the unmatched parameters between two classes? It seems to be matched though they have different parameter names.YLJ
for example the scikit model has a parameter called "penalty" which defaults to "l2". However, i cant find the same parameter in the pyspark model implementation. Another exmaple would be the parameter "aggregationDepth" in the pyspark model - its missing in scikit's implementationAaronDT
@frankyjuang pls see my updated question where included a list of the parameters of each modelAaronDT
For penalty in scikit, set elasticNetParam in pyspark to match the setting. And aggregationDepth doesn't usually make difference to the result.YLJ

2 Answers

5
votes

pyspark's LR uses ElasticNet regularization, which is a weighted sum of L1 and L2 terms; weight is elasticNetParam. So with elasticNetParam=0 you get L2 regularization, and regParam is L2 regularization coefficient; with elasticNetParam=1 you get L1 regularization, and regParam is L1 regularization coefficient. C in sklearn LogisticRegression is inverse of regParam, i.e. regParam = 1/C.

Also, default training methods are different; you may need to set solver='lbfgs' in sklearn LogisticRegression to make training methods more similar. It only works with L2 though.

If you need ElasticNet regularization (i.e. 0 < elasticNetParam < 1), then sklearn implements it in SGDClassifier - set loss='elasticnet', alpha would be similar to regParam (and you don't have to inverse it, like C), and l1_ratio would be elasticNetParam.

sklearn doesn't provide threshold directly, but you can use predict_proba instead of predict, and then apply the threshold yourselves.

Disclaimer: I have zero spark experience, the answer is based on sklearn and spark docs.

3
votes

By now I figured out that as indicated by the parameter standardization=True pyspark does standardize the data within the model whereas scikit doesn't. Implementing preprocessing.scalebefore applying the scikit model gave me close matching results for both models