3
votes

Using ml, Spark 2.0 (Python) and a 1.2 million row dataset, I am trying to create a model that predicts purchase tendency with a Random Forest Classifier. However when applying the transformation to the splitted test dataset the prediction is always 0.

The dataset looks like:

[Row(tier_buyer=u'0', N1=u'1', N2=u'0.72', N3=u'35.0', N4=u'65.81', N5=u'30.67', N6=u'0.0'....

tier_buyer is the field used as a label indexer. The rest of the fields contain numeric data.

Steps

1.- Load the parquet file, and fill possible null values:

parquet = spark.read.parquet('path_to_parquet')
parquet.createOrReplaceTempView("parquet")
dfraw = spark.sql("SELECT * FROM parquet").dropDuplicates()
df = dfraw.na.fill(0)

2.- Create features vector:

features = VectorAssembler(
                inputCols = ['N1','N2'...],
                outputCol = 'features')

3.- Create string indexer:

label_indexer = StringIndexer(inputCol = 'tier_buyer', outputCol = 'label')

4.- Split the train and test datasets:

(train, test) = df.randomSplit([0.7, 0.3])

Resulting train dataset

enter image description here

Resulting Test dataset

enter image description here

5.- Define the classifier:

classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features')

6.- Pipeline the stages and fit the train model:

pipeline = Pipeline(stages=[features, label_indexer, classifier])
model = pipeline.fit(train)

7.- Transform the test dataset:

predictions = model.transform(test)

8.- Output the test result, grouped by prediction:

predictions.select("prediction", "label", "features").groupBy("prediction").count().show()

enter image description here

As you can see, the outcome is always 0. I have tried with multiple feature variations in hopes of reducing the noise, also trying from different sources and infering the schema, with no luck and the same results.

Questions

  • Is the current setup, as described above, correct?
  • Could the null value filling on the original Dataframe be source of failure to effectively perform the prediction?
  • In the screenshot shown above it looks like some features are in the form of a tuple and other of a list, why? I'm guessing this could be a possible source of error. (They are representation of Dense and Sparse Vectors)
1
I have answer a similar question on my personal gist a while ago. You can maybe take a look at it gist.github.com/eliasah/8709e6391784be0feb7fe9dd31ae0c0aeliasah
Thank you @eliasah I wil take a look on stratified sampling. Do you know why some of the feature results appear in the form of a tuple and other as a list?TMichel
Those are just representations of Dense and Sparse vectors.eliasah
What about predictions ? Would you care grouping by predictions after training ? I also believe that I have answer one of your question about vector representation. It's not a source of error. One more question also, does your data have duplicates ?eliasah

1 Answers

0
votes

It seems your features [N1, N2, ...] are strings. You man want to cast all your features as FloatType() or something along those lines. It may be prudent to fillna() after type casting.