I'm attempting to train an XgBoost model with a Spark DataFrame that looks like this :
+--------------------+-------------------+
| features| TARGET_VAL|
+--------------------+-------------------+
|(122,[0,1,9,10,11...| 0.0|
|(122,[0,1,8,9,11,...| 14.577420000000002|
|[4.0,1.0,0.0,0.0,...| 65.44524|
|(122,[0,1,8,9,11,...| 0.0|
|(122,[0,1,8,9,10,...| 18.27017|
|(122,[0,1,8,11,12...| 0.0|
|(122,[0,1,8,10,11...| 75.75954|
|(122,[0,1,10,11,1...| 65.32013|
|[1.0,0.0,1.0,0.0,...| 171.16563|
|(122,[0,1,8,11,12...| 0.0|
|(122,[0,1,8,9,11,...| 0.0|
|(122,[0,1,8,10,11...| 2.27041|
|(122,[0,1,11,12,2...| 0.0|
|[4.0,1.0,0.0,0.0,...| 76.08024|
|(122,[0,1,8,9,11,...| 0.0|
|(122,[0,1,8,10,11...| 15.31895|
|(122,[0,1,8,10,11...| 122.56702|
|(122,[0,1,8,10,11...|-30.268179999999997|
|(122,[0,1,8,10,11...| 0.0|
|(122,[0,1,10,11,4...| 136.80025|
+--------------------+-------------------+
I'm using sparkxgb (XgBoost with PySpark) and I'm training the model like this :
paramMap = {'eta': 0.1, 'subsample': 0.8}
xgbClassifier = XGBoostClassifier(**paramMap) \
.setFeaturesCol("features") \
.setLabelCol("TARGET_VAL")
When I'm training the model with :
xgboostModel = xgbClassifier.fit(df)
I'm getting the following error :
java.lang.IllegalArgumentException: requirement failed: Classifier found max label value = 23470.00821 but requires integers in range [0, ... 2147483647)
So, I cast the TARGET_VAL column to int and upon doing that I'm getting the following error :
java.lang.IllegalArgumentException: requirement failed: Classifier inferred 23471 from label values in column XGBoostClassifier_37d67e9f2233__labelCol, but this exceeded the max numClasses (100) allowed to be inferred from values. To avoid this error for labels with > 100 classes, specify numClasses explicitly in the metadata; this can be done by applying StringIndexer to the label column.
I'm new to XgBoost and Machine Learning. I think TARGET_VAL is the column that the trained model will predict for a test dataset and it's supposed to be a floating point value. So, what am I doing wrong? Do I need to configure the model with different parameters?