You can save pipelines and models. In case of loading these models, you need to know apriori the kind of model corresponding to each one. For example:
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, VectorIndexer, OneHotEncoder, StringIndexer, OneHotEncoderEstimator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
df = *YOUR DATAFRAME*
categoricalColumns = ["A", "B", "C"]
stages = []
for categoricalCol in categoricalColumns:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol="id_imp", outputCol="label")
stages += [label_stringIdx]
assemblerInputs = [c + "classVec" for c in categoricalColumns]
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
pipeline = Pipeline(stages=stages)
pipelineModel = pipeline.fit(df)
pipelineModel.save("/path")
In the previous case, I saved a Pipeline with different stages.
pipelineModel.save("/path")
Now, if you want to use them:
pipelineModel = Pipeline.load("/path")
df = pipelineModel.transform(df)
You can do the same for other cases, like:
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=2)
(trainingData, testData) = df.randomSplit([0.7, 0.3], seed=100)
cvModel = cv.fit(trainingData)
cvModel.save("/path")
cvM = CrossValidatorModel.load("/path")
predictions2 = cvM.transform(testData)
predictions = cvModel.transform(testData)
In brief, if you want to load the model you need to use the corresponding object.