ML Pipeline for Spark Scala

Question

I have a dataframe (df) with the following structure:

Data

label pa_age pa_gender_category
10000 32.0   male
25000 36.0   female
45000 68.0   female
15000 24.0   male

Objective

I wanted to build a RandomForest Classifier for the column 'label' where column 'pa_age' and 'pa_gender_category' are the features

Process Followed

// Transform the labels column into labels index

val labelIndexer = new StringIndexer().setInputCol("label")
.setOutputCol("indexedLabel").fit(df)

// Transform column gender_category into labels

val featureTransformer = new StringIndexer().setInputCol("pa_gender_category")
.setOutputCol("pa_gender_category_label").fit(df)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

Expected Output from the above steps:

label pa_age pa_gender_category indexedLabel pa_gender_category_label
10000 32.0   male               1.0          1.0
25000 36.0   female             2.0          2.0
45000 68.0   female             3.0          2.0
10000 24.0   male               1.0          1.0

Now I need the data into 'label' and 'feature' format

val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)

Pipeline

val pipeline = new Pipeline().setStages(Array(labelIndexer, featureTransformer,
featureCreater, rf, labelConverter))

Problem

error: value fit is not a member of org.apache.spark.ml.feature.VectorAssembler
       val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category_label")).setOutputCol("features").fit(df)

Basically its the step from converting data into label and feature format that I am facing trouble.
Is my process/pipeline correct here ?

Jozef Dúc Jozef Dúc · Accepted Answer · 2017-04-27T08:07:11

The problem is here

val featureCreater = new VectorAssembler().setInputCols(Array("pa_age", "pa_gender_category"))
.setOutputCol("features").fit(df)

You can not call fit(df) here, because VectorAssembler does not have method fit. Do not forget to remove .fit(df) in StringIndexer and IndexToString also. After the pipeline initialization call your fit method on pipeline object.

val model = pipeline.fit(df)

Now pipeline goes through each algorithm which you provided into it.

StringIndexer does not have property labels, use getOutputCol instead of it.

ML Pipeline for Spark Scala

1 Answers