Problem: The default implementations (no custom parameters set) of the logistic regression model in pyspark and scikit-learn seem to yield different results given their default paramter values.
I am trying to replicate a result from logistic regression (no custom paramters set) performed with pypark (see: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression) with the logistic regression model from scikit-learn (see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
It appears to me that both model implementations (in pyspark and scikit) do not possess the same parameters, so i cant just simply match the paramteres in scikit to fit those in pyspark. Is there any solution on how to match both models on their default configuration?
Parameters Scikit model (default parameters):
`LogisticRegression(
C=1.0,
class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
max_iter=100,
multi_class='ovr',
n_jobs=1,
penalty='l2',
random_state=None,
solver='liblinear',
tol=0.0001,
verbose=0,
warm_start=False`
Parameters Pyspark model (default parameters):
LogisticRegression(self,
featuresCol="features",
labelCol="label",
predictionCol="prediction",
maxIter=100,
regParam=0.0,
elasticNetParam=0.0,
tol=1e-6,
fitIntercept=True,
threshold=0.5,
thresholds=None,
probabilityCol="probability",
rawPredictionCol="rawPrediction",
standardization=True,
weightCol=None,
aggregationDepth=2,
family="auto")
Thank you very much!
penalty
in scikit, setelasticNetParam
in pyspark to match the setting. AndaggregationDepth
doesn't usually make difference to the result. – YLJ