0
votes

I am relatively new to the data mining area and have been experimenting with Weka.

I have a dataset which consists of almost 8000 records related to customers and items they have purchased. 58% of this data set has missing values for the "Gender" attribute.

I want to find the missing gender values based on the other data I do have.

I first thought I could do this using a classifier algorithm in Weka using a training set to build a model. Based on examples I saw online, I tried this with pretty much all the available algorithms available in Weka using a training set that consisted of 60-80% of the data which did not have missing values. This gave me a lower accuracy rate than I wanted (80-86% depending on the algorithm used)

Did I go about this correctly? Is there a way to improve this accuracy? I experimented with using different attributes, different pre-processing of the data etc.

I also tried using the ReplaceMissingValues filter on the complete dataset to see how that would handle the missing values. However, it just changed all the missing values to "Female" which obviously cannot be the case. So I'm wondering also wondering if I need to use this filter in my situation or not.

1

1 Answers

2
votes

It sounds like you went about it in the correct way. The ReplaceMissingValues filter replaces the missing values with the most frequent of the non-missing values I think, so it is not what you want in this case.

A better way to get an idea of the true accuracy of your gender-predictor would be to use cross-validation instead of the training/test split (Weka has a separate option for that). 80-86% may seem low, but keep in mind that random guessing will only get you about 50%, so it's still a lot better than that. To try to get better performance, pick a classifier that performs well and then play with its parameters until you get better performance. This is likely to be quite labour-intensive (although you could of course use automated methods for tuning, see e.g. Auto-WEKA), but the only way to improve the performance.

You can also combine the algorithm you choose with a separate feature selection step (Weka has a special meta-classifier for this). This may improve performance, but again you'll have to experiment to find the particular configuration that works for you.