For big datasets with 2bil+ samples and approximately 100+ features per sample. Among these, 10% features you have are numerical/continuous variables and the rest of it are categorical variables
(position, languages, url etc...).
Let's use some examples:
e.g: dummy categorical feature
feature:
Position
real values: SUD | CENTRE | NORTH encoded values: 1 | 2 | 3
...would have sense use reduction like SVD
because distance beetween sud:north > sud:centre and, moreover, it's possible to encode (e.g OneHotEncoder
, StringIndexer
) this variable because of the small cardinality of it values-set.
e.g: real categorical feature
feature:
url
real values: very high cardinality encoded values: ?????
1) In MLlib
the 90% of the model works just with numerical values (a part of Frequent Itemset and DecisionTree techniques)
2) Features transformers/reductor/extractor
as PCA or SVD are not good for these kind of data, and there is no implementation of (e.g) MCA
a) Which could be your approach
to engage with this kind of data in spark, or using Mllib?
b) Do you have any suggestions
to cope with this much categorical values?
c) After reading a lot in literature, and counting the implemented model in spark, my idea, about make inference on one of that features using the others (categorical), the models at point 1 could be the best coiche. What do you think about it?
(to standardize a classical use case you can imagine the problem of infer the gender of a person using visited url and other categorical features).
Given that I am a newbie in regards to MLlib, may I ask you to provide a concrete example? Thanks in advance