PySpark MLLIB Random Forest: prediction always 0

Question

Using ml, Spark 2.0 (Python) and a 1.2 million row dataset, I am trying to create a model that predicts purchase tendency with a Random Forest Classifier. However when applying the transformation to the splitted test dataset the prediction is always 0.

The dataset looks like:

[Row(tier_buyer=u'0', N1=u'1', N2=u'0.72', N3=u'35.0', N4=u'65.81', N5=u'30.67', N6=u'0.0'....

tier_buyer is the field used as a label indexer. The rest of the fields contain numeric data.

Steps

1.- Load the parquet file, and fill possible null values:

parquet = spark.read.parquet('path_to_parquet')
parquet.createOrReplaceTempView("parquet")
dfraw = spark.sql("SELECT * FROM parquet").dropDuplicates()
df = dfraw.na.fill(0)

2.- Create features vector:

features = VectorAssembler(
                inputCols = ['N1','N2'...],
                outputCol = 'features')

3.- Create string indexer:

label_indexer = StringIndexer(inputCol = 'tier_buyer', outputCol = 'label')

4.- Split the train and test datasets:

(train, test) = df.randomSplit([0.7, 0.3])

Resulting train dataset

Resulting Test dataset

5.- Define the classifier:

classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')

6.- Pipeline the stages and fit the train model:

pipeline = Pipeline(stages=[features, label_indexer, classifier])
model = pipeline.fit(train)

7.- Transform the test dataset:

predictions = model.transform(test)

8.- Output the test result, grouped by prediction:

predictions.select("prediction", "label", "features").groupBy("prediction").count().show()

As you can see, the outcome is always 0. I have tried with multiple feature variations in hopes of reducing the noise, also trying from different sources and infering the schema, with no luck and the same results.

Questions

Is the current setup, as described above, correct?
Could the null value filling on the original Dataframe be source of failure to effectively perform the prediction?
~~In the screenshot shown above it looks like some features are in the form of a tuple and other of a list, why? I'm guessing this could be a possible source of error.~~ (They are representation of Dense and Sparse Vectors)

I have answer a similar question on my personal gist a while ago. You can maybe take a look at it gist.github.com/eliasah/8709e6391784be0feb7fe9dd31ae0c0a — eliasah
Thank you @eliasah I wil take a look on stratified sampling. Do you know why some of the feature results appear in the form of a tuple and other as a list? — TMichel
What about predictions ? Would you care grouping by predictions after training ? I also believe that I have answer one of your question about vector representation. It's not a source of error. One more question also, does your data have duplicates ? — eliasah

the-ucalegon the-ucalegon · Accepted Answer · 2018-03-14T21:03:03

It seems your features [N1, N2, ...] are strings. You man want to cast all your features as FloatType() or something along those lines. It may be prudent to fillna() after type casting.

PySpark MLLIB Random Forest: prediction always 0

Steps

Questions

1 Answers