4
votes

I have a dataset with 41 features [from 0 to 40 columns], of which 7 are categorical. This categorical set is divided in two subset:

  • A subset of string type(the column-features 1, 2, 3)
  • A subset of int type, in binary form 0 or 1 (the column-features 6, 11, 20, 21)

Furthermore the column-features 1, 2 and 3 (of string type) have cardinality 3, 66 and 11 respectively. In this context I have to encode them to use support vector machine algorithm. This is the code that I have:

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn import feature_extraction

df = pd.read_csv("train.csv")
datanumpy = df.as_matrix()
X = datanumpy[:, 0:40]  # select columns 1 through 41 (the features)
y = datanumpy[:, 41]  # select column 42 (the labels)

I don't know if is better to use DictVectorizer() or OneHotEncoder() [for the reasons that I exposed above], and mostly in which way use them [in term of code] with the X matrix that I have. Or should I simply assign a number to each cardinality in the subset of string type (since they have high cardinality and so my feature space will increase exponentially)?

EDIT With respect to subset of int type I guess that the best choice is to keep the column-features as they are (don't pass them to any encoder) The problem persist for subset of string type with high cardinality.

3

3 Answers

3
votes

This is by far the easiest:

 df = pd.get_dummies(df, drop_first=True)

If you get a memory overflow or it is too slow then reduce the cardinality:

top = df[col].isin(df[col].value_counts().index[:10])
df.loc[~top, col] = "other"
1
votes

As per the official documentation of One Hot Encoder, it should be applied over the combined dataset (Train and Test). Otherwise it may not form a proper encoding.

And performance-wise I think One Hot Encoder is much better than DictVectorizer.

0
votes

You can use the pandasmethod .get_dummies() as suggested by @simon here above, or you can use the sklearn equivalent given by OneHotEncoder.

I prefer OneHotEncoder because you can pass to it parameters like the categorical features you want to encode and the number of values to keep for each feature (if not indicated, it will select automatically the optimal number).

If, for some features, the cardinality is too big, impose low n_values. If you have enough memory don't worry, encode all the values of your features.

I guess for a cardinality of 66, if you have a basic computer, encoding all of the 66 features won't lead to a memory issue. Memory overflow usually happens when you have for example as much values for a feature as the number of samples in your dataset (the case for IDs that are unique per sample). The bigger the dataset, the more likely you'll get a memory issue.