SparkML MultilayerPerceptron error: java.lang.ArrayIndexOutOfBoundsException

Question

I have the following model that I would like to estimate using SparkML MultilayerPerceptronClassifier().

val formula = new RFormula()
  .setFormula("vtplus15predict~ vhisttplus15 + vhistt + vt + vtminus15 + Time + Length + Day")
  .setFeaturesCol("features")
  .setLabelCol("label")

formula.fit(data).transform(data)

Note: The features is a vector and label is a Double

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)

I define my MLP estimator as follows:

val layers = Array[Int](6, 5, 8, 1) //I suspect this is where it went wrong

val mlp = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)

// train the model
val model = mlp.fit(train)

Unfortunately, I got the following error:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 11 at org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121) at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245) at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245) at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:935) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:950) ...

Shaido Shaido · Accepted Answer · 2017-12-19T05:48:04

org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121)

This tells us that an array is out of bounds in the MultilayerPerceptronClassifier.scala file, let's look at the code there:

def encodeLabeledPoint(labeledPoint: LabeledPoint, labelCount: Int): (Vector, Vector) = {
  val output = Array.fill(labelCount)(0.0)
  output(labeledPoint.label.toInt) = 1.0
  (labeledPoint.features, Vectors.dense(output))
}

It performs a one-hot encoding of the labels in the dataset. The ArrayIndexOutOfBoundsException occurs since the output array is too short.

By going back in the code, it's possible to find that labelCount is the same as the number of output nodes in the layers array. In other words, the number of output nodes should be the same as the number of classes. Looking at the documentation for MLP there is the following line:

The number of nodes N in the output layer corresponds to the number of classes.

The solution is therefore to either:

Change the number of nodes in the final layer of the network (output nodes)
Reconstruct the data to have the same number of classes as your network output nodes.

Note: The final output layer should always be 2 or more, never 1, since there should be one node per class and a problem with a single class does not make sense.

SparkML MultilayerPerceptron error: java.lang.ArrayIndexOutOfBoundsException

3 Answers