Spark + scala new pipline for StringIndexer multiple columns

Question

I try to apply StringIndexer() on multiple columns, I work with Scala and Spark 2.3.
This is my code:

val df1 = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("file:///c:/tmp/spark-warehouse/train.csv")

val feat = df1.columns.filterNot(_ .contains("BsmtFinSF1"))

val inds = feat.map { colName =>
  val indexer1 = new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "I")
    .fit(df1)

  Array(indexer1)
}

val pipeline = new Pipeline().setStages(inds.toArray)

But, I have this error:

Error:(134, 50) type mismatch;

found : Array[Array[org.apache.spark.ml.feature.StringIndexerModel]]
required: Array[? <: org.apache.spark.ml.PipelineStage]

Note: Array[org.apache.spark.ml.feature.StringIndexerModel] >: ? <: org.apache.spark.ml.PipelineStage, but class Array is invariant in type T. You may wish to investigate a wildcard type such as _ >: ? <: org.apache.spark.ml.PipelineStage. (SLS 3.2.10)
val pipeline = new Pipeline().setStages(inds.toArray)

Any help will be appreciated. thank you

Alexey Novakov Alexey Novakov · Accepted Answer · 2019-02-02T21:51:52

.setStages takes an Array[PipelineStage], but actually it becomes Array[Array[PipelineStage] because you wrap indexer1 into redundant Array here: Array(indexer1). Map function returns a collection of the same type. Elements of this collection are resulted by an application of a function passed to Map. So just try like this:

val inds = feat.map { colName =>
   new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "I")
    .fit(df1)          
}

Spark + scala new pipline for StringIndexer multiple columns

1 Answers