Sklearn Voting ensemble with models using different features and testing with k-fold cross validation

Question

I have a data frame with 4 different groups of features.

I need to create 4 different models with these four different feature groups and combine them with the ensemble voting classifier. Furthermore, I need to test the classifier using k-fold cross validation.

However, I am finding it difficult to combine different feature sets, voting classifier and k-fold cross validation with functionality available in sklearn. Following is the code that I have so far.

y = df1.index
x = preprocessing.scale(df1)

SVM = svm.SVC(kernel='rbf', C=1)
rf=RandomForestClassifier(n_estimators=200)
ann = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(25, 2), random_state=1)
neigh = KNeighborsClassifier(n_neighbors=10)

models = list()
models.append(('facial', SVM))
models.append(('posture', rf))
models.append(('computer', ann))
models.append(('physio', neigh))

ens = VotingClassifier(estimators=models)

cv = KFold(n_splits=10, random_state=None, shuffle=True)
scores = cross_val_score(ens, x, y, cv=cv, scoring='accuracy')

As you can see, this program uses same features for all 4 models. How can I improve this program to achieve my objective?

This works fine, but my objective is to use different groups of features for each model. Here all models use all the features available in my dataset. — Chamila Wijayarathna
This might be helpful stackoverflow.com/questions/45074579/… — Parthasarathy Subburaj
I already referred this, however, answers posted their do not use k-fold cross validation — Chamila Wijayarathna
You need to append a column selection before each estimator. See the example here. So your final VotingClassifier will have list of pipelines (one for each column selector and estimator). Try and implement this approach. If still not able to solve, I will post an answer. — Vivek Kumar

Chamila Wijayarathna Chamila Wijayarathna · Accepted Answer · 2020-05-29T06:35:38

I did manage to achieve this using Pipelines,

y = df1.index
x = preprocessing.scale(df1)

phy_features = ['A', 'B', 'C']
phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
phy_processer = ColumnTransformer(transformers=[('phy', phy_transformer, phy_features)])

fa_features = ['D', 'E', 'F']
fa_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
fa_processer = ColumnTransformer(transformers=[('fa', fa_transformer, fa_features)])


pipe_phy = Pipeline(steps=[('preprocessor', phy_processer ),('classifier', SVM)])
pipe_fa = Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)])

ens = VotingClassifier(estimators=[pipe_phy, pipe_fa])

cv = KFold(n_splits=10, random_state=None, shuffle=True)
for train_index, test_index in cv.split(x):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    ens.fit(x_train,y_train)
    print(ens.score(x_test, y_test))

Please refer sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable for if you are receiving an TypeError when using ColumnTransforms.

Sklearn Voting ensemble with models using different features and testing with k-fold cross validation

1 Answers