1
votes

I have the following model that I would like to estimate using SparkML MultilayerPerceptronClassifier().

val formula = new RFormula()
  .setFormula("vtplus15predict~ vhisttplus15 + vhistt + vt + vtminus15 + Time + Length + Day")
  .setFeaturesCol("features")
  .setLabelCol("label")

formula.fit(data).transform(data)

Note: The features is a vector and label is a Double

root
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)

I define my MLP estimator as follows:

val layers = Array[Int](6, 5, 8, 1) //I suspect this is where it went wrong

val mlp = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)

// train the model
val model = mlp.fit(train)

Unfortunately, I got the following error:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.lang.ArrayIndexOutOfBoundsException: 11 at org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121) at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245) at org.apache.spark.ml.classification.MultilayerPerceptronClassifier$$anonfun$3.apply(MultilayerPerceptronClassifier.scala:245) at scala.collection.Iterator$$anon$11.next(Iterator.scala:363) at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:935) at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:950) ...

3

3 Answers

4
votes

org.apache.spark.ml.classification.LabelConverter$.encodeLabeledPoint(MultilayerPerceptronClassifier.scala:121)

This tells us that an array is out of bounds in the MultilayerPerceptronClassifier.scala file, let's look at the code there:

def encodeLabeledPoint(labeledPoint: LabeledPoint, labelCount: Int): (Vector, Vector) = {
  val output = Array.fill(labelCount)(0.0)
  output(labeledPoint.label.toInt) = 1.0
  (labeledPoint.features, Vectors.dense(output))
}

It performs a one-hot encoding of the labels in the dataset. The ArrayIndexOutOfBoundsException occurs since the output array is too short.

By going back in the code, it's possible to find that labelCount is the same as the number of output nodes in the layers array. In other words, the number of output nodes should be the same as the number of classes. Looking at the documentation for MLP there is the following line:

The number of nodes N in the output layer corresponds to the number of classes.

The solution is therefore to either:

  1. Change the number of nodes in the final layer of the network (output nodes)

  2. Reconstruct the data to have the same number of classes as your network output nodes.

Note: The final output layer should always be 2 or more, never 1, since there should be one node per class and a problem with a single class does not make sense.

0
votes

rearrange your dataset as the error shows you have fewer arrays than you have in your features set or your data set has a null set which prompted an error.I came across this type of error while working on my MLP project.hope my answer helps you. thanks for reaching out

-2
votes

The solution is to first find the local optimal that allows one to escape the ArrayIndexOutBound and then use brute-force search to find the global optimal. Shaido suggest finding n

For example, val layers = Array[Int](6, 5, 8, n). This assumes the length of the feature vectors are 6. – Shaido

So make n be a large integer(n =100) then manually use brute-force search to arrive at a good solution(n =50 then try n =32 - error, n = 35 - perfect).

Credit to Shaido.