spark - MLlib: transform and manage categorical features

Question

For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables (position, languages, url etc...). Let's use some examples:

e.g: dummy categorical feature

feature: Position

    real values:     SUD | CENTRE | NORTH
    encoded values:   1  |    2   |   3

...would have sense use reduction like SVD because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder, StringIndexer) this variable because of the small cardinality of it values-set.

e.g: real categorical feature

feature: url

    real values:     very high cardinality
    encoded values:  ?????

1) In MLlibthe 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)

2) Features transformers/reductor/extractor as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA

a) Which could be your approach to engage with this kind of data in spark, or using Mllib?

b) Do you have any suggestions to cope with this much categorical values?

c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?

(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).

Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? Thanks in advance

Dr VComas Dr VComas · Accepted Answer · 2015-12-08T18:35:08

Well, first I would say stackoverflow works in a different way, you should be the one providing a working example with the problem you are facing and we help you out using this example.

Anyways I got intrigued with the use of the categorical values like the one you show as position. If this is a categorical value as you mention with 3 levels SUD,CENTRE, NORTH, there is no distance between them if they are truly categorical. In this sense I would create dummy variables like:

       SUD_Cat  CENTRE_Cat  NORTH_Cat 
SUD      1         0           0
CENTRE   0         1           0
NORTH    0         0           1

This is a truly dummy representation of a categorical variable.

On the other hand if you want to take that distance into account then you have to create another feature which takes this distance into account explicitly, but that is not a dummy representation.

If the problem you are facing is that after you wrote your categorical features as dummy variables (note that now all of them are numerical) you have very many features and you want to reduce your feature's space, then is a different problem.

As a rule of thumbs I try to utilize the entire feature space first, now a plus since in spark computing power allows you to run modelling tasks with big datasets, if it is too big then I would go for dimensionality reduction techniques, PCA etc...

spark - MLlib: transform and manage categorical features

1 Answers