I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:
- A subset of string type(the column-features 1, 2, 3)
- A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21)
Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively. In this context I have to encode them to use support vector machine algorithm. This is the code that I have:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import feature_extraction
df = pd.read_csv("train.csv")
datanumpy = df.as_matrix()
X = datanumpy[:, 0:40] # select columns 1 through 41 (the features)
y = datanumpy[:, 41] # select column 42 (the labels)
I don't know if is better to use DictVectorizer()
or OneHotEncoder()
[for the reasons that I exposed above], and mostly in which way use them [in term of code] with the X
matrix that I have.
Or should I simply assign a number to each cardinality in the subset of string type (since they have high cardinality and so my feature space will increase exponentially)?
EDIT With respect to subset of int type I guess that the best choice is to keep the column-features as they are (don't pass them to any encoder) The problem persist for subset of string type with high cardinality.