Unable to train model with XgBoost - PySpark

Question

I'm attempting to train an XgBoost model with a Spark DataFrame that looks like this :

+--------------------+-------------------+
|            features|         TARGET_VAL|
+--------------------+-------------------+
|(122,[0,1,9,10,11...|                0.0|
|(122,[0,1,8,9,11,...| 14.577420000000002|
|[4.0,1.0,0.0,0.0,...|           65.44524|
|(122,[0,1,8,9,11,...|                0.0|
|(122,[0,1,8,9,10,...|           18.27017|
|(122,[0,1,8,11,12...|                0.0|
|(122,[0,1,8,10,11...|           75.75954|
|(122,[0,1,10,11,1...|           65.32013|
|[1.0,0.0,1.0,0.0,...|          171.16563|
|(122,[0,1,8,11,12...|                0.0|
|(122,[0,1,8,9,11,...|                0.0|
|(122,[0,1,8,10,11...|            2.27041|
|(122,[0,1,11,12,2...|                0.0|
|[4.0,1.0,0.0,0.0,...|           76.08024|
|(122,[0,1,8,9,11,...|                0.0|
|(122,[0,1,8,10,11...|           15.31895|
|(122,[0,1,8,10,11...|          122.56702|
|(122,[0,1,8,10,11...|-30.268179999999997|
|(122,[0,1,8,10,11...|                0.0|
|(122,[0,1,10,11,4...|          136.80025|
+--------------------+-------------------+

I'm using sparkxgb (XgBoost with PySpark) and I'm training the model like this :

paramMap = {'eta': 0.1, 'subsample': 0.8}

xgbClassifier = XGBoostClassifier(**paramMap) \
    .setFeaturesCol("features") \
    .setLabelCol("TARGET_VAL")

When I'm training the model with :

xgboostModel = xgbClassifier.fit(df)

I'm getting the following error :

java.lang.IllegalArgumentException: requirement failed: Classifier found max label value = 23470.00821 but requires integers in range [0, ... 2147483647)

So, I cast the TARGET_VAL column to int and upon doing that I'm getting the following error :

java.lang.IllegalArgumentException: requirement failed: Classifier inferred 23471 from label values in column XGBoostClassifier_37d67e9f2233__labelCol, but this exceeded the max numClasses (100) allowed to be inferred from values.  To avoid this error for labels with &gt; 100 classes, specify numClasses explicitly in the metadata; this can be done by applying StringIndexer to the label column.

I'm new to XgBoost and Machine Learning. I think TARGET_VAL is the column that the trained model will predict for a test dataset and it's supposed to be a floating point value. So, what am I doing wrong? Do I need to configure the model with different parameters?

pissall pissall · Accepted Answer · 2019-10-17T17:59:46

The issue here is that since TARGET_VAL is continuous variable column and the XGBoostClassifier expects a discrete/categorical variable column. There are way too many classes for the classifer. As you can see in the error max numClasses is 100 and I'm sure you have more than 100 numbers.

You are using a classification algorithm for a regression problem.

Continuous vs Discrete Variables - Wiki

Unable to train model with XgBoost - PySpark

1 Answers