Spark ML: Issue in training after using ChiSqSelector for feature selection

Question

I'm new to spark. I am working on a classification model and want to use ChiSqSelector to choose the important features for model training. But, when I use the selected features by ChiSqSelector to train, it throws the following error:

"IllegalArgumentException: u'Feature 0 is marked as Nominal (categorical), but it does not have the number of values specified."

Interestingly, I got the above mentioned error when I used any of the tree based algorithms. For, Naive bias and logistic regression, I didn't get the error.

I found same result when I used the data provided in the sample code in spark documentation. The error could be reproduced by using the code from spark 2.1.1 documentation:

from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors

df = spark.createDataFrame([
    (7, Vectors.dense([0.0, 0.0, 18.0, 1.0]), 1.0,),
    (8, Vectors.dense([0.0, 1.0, 12.0, 0.0]), 0.0,),
    (9, Vectors.dense([1.0, 0.0, 15.0, 0.1]), 0.0,)], ["id", "features", 
"clicked"])
selector = ChiSqSelector(numTopFeatures=2, featuresCol="features",
                     outputCol="selectedFeatures", labelCol="clicked")
result = selector.fit(df).transform(df)
print("ChiSqSelector output with top %d features selected" % 
selector.getNumTopFeatures())
result.show()
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(labelCol="clicked", 
featuresCol="selectedFeatures")
model = dt.fit(result)

Someone reported the problem at Apache Spark User List (following link) but nobody responded. http://apache-spark-user-list.1001560.n3.nabble.com/Application-of-ChiSqSelector-results-in-quot-Feature-0-is-marked-as-Nominal-quot-td27040.html

I would highly appreciate if someone sheds some light on it. Thanks in advance.

hello hello · Accepted Answer · 2018-07-03T08:32:48

I met this problem, too. feature column SparseVector -> DenseVector can make it run I don't know if there's a better way to do it

Spark ML: Issue in training after using ChiSqSelector for feature selection

1 Answers