MLlib loading models from different programming language

Question

I am new to Spark and creating a platform which supports Machine Learning, I am looking for a way to save models and came up with the save method of the models.

Its documentation states:

Save this model to the given path.
This saves: - human-readable (JSON) model metadata to path/metadata/ - Parquet formatted data to path/data/

I am looking for a way to load models written in all programming languages (Python, Java, Scala) using one programming language only (Java).

Is it possible to simply load the model using different programming language?

@eliasah Hi, and thank you for the answer, according to the documentation not all models support this kind of model export so its not suitable for my needs. — Anton.P
You have two options : The first one, and it is how it's actually done (using PMML) in case the model you intend on using doesn't support PMML export, you'll need to implement it yourself. (Reminder PMML is an XML-standard developed for this purpose). Another way to do so is to export your model as an object, between Java and Scala that should be straightforward if I'm not mistaken since Scala in JVM-based, otherwise, for Python you'll need to use frameworks like Py4J. Personally I'll go with option 1. — eliasah
@eliasah As far as I know. I used this option only a few time but seems to work just fine. Problem is with models which cannot be saved this way (ML models and pipelines for example). — zero323

zero323 zero323 · Accepted Answer · 2015-12-13T14:34:37

Generally speaking save method of MLlib models which extend Saveable generate language agnostic output. It means it can be simply loaded using any supported language. For example (code from the official documentation):

Python:

from pyspark.mllib.clustering import KMeans, KMeansModel
import numpy as np

data = sc.textFile("data/mllib/kmeans_data.txt")
parsedData = data.map(lambda line: np.array([float(x) for x in line.split(' ')]))

clusters = KMeans.train(parsedData, 2, maxIterations=10,
    runs=10, initializationMode="random")

clusters.centers
## [array([ 9.1,  9.1,  9.1]), array([ 0.1,  0.1,  0.1])]

clusters.predict(np.array([0.2, 0.2, 0.2]))
## 1

clusters.save(sc, "clusters"

Scala:

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

val clusters = KMeansModel.load(sc, "clusters")

clusters.clusterCenters
// Array[org.apache.spark.mllib.linalg.Vector] = Array(
//   [9.099999999999998,9.099999999999998,9.099999999999998],
//   [0.1,0.1,0.1])

clusters.predict(Vectors.dense(Array(0.2, 0.2, 0.2)))
// 1

MLlib loading models from different programming language

1 Answers