Recently I have to prepare some lab material for students to learn machine learning using Spark/MLlib/Scala. I am familiar with machine learning but new to Spark.
One "textbook" trick of machine learning is to add higher degree terms of original features to allow non-linear model. Let say, after I load the training data from a LIBSVM file, I want to add the square of all features in addition to the original ones. My current limited knowledge yields the below implementation:
// Load training data in LIBSVM format.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val data_add_square = data.map(s => {
val tmp = sc.parallelize(s.features.toArray)
val tmp_sqr = tmp.map(x => x*x)
new LabeledPoint(s.label, Vectors.dense(s.features.toArray ++ tmp_sqr.collect))
})
Somehow I feel this implementation is too "heavyweight" and looks not the right way to do that. Can anyone shed some light on this issue?