How to do cross validation and grid search if I have a customized ensemble model in python pipeline

Question

I am building a customized ensemble model and would like to do cross validation and grid search in python using pipeline. How can I do it?

I have a data set contains web content. What I want to do is

Split the content from one single webpage into to two part. The reason of splitting is because the text are from different places of the page and I want to handle them separately.
I train a model1 using only the features from part1, and train a model2 using only the features from part2.
Assume I got a score from model1 as S1, and score from model2 as S2. I train another model, saying logistic regression model, to ensemble these two scores to a final score S.

through this whole process, is there a way that I can use ML pipeline in sklearn to do cross validation and grid search?

I appreciate Dev's reply below, however when I tried to do the same thing I am encountering new problems. I have code as following:

data = pd.DataFrame(columns = ['landingVector', 'contentVector', 'label'])

def extractLandingData(X):
        return X['landingVector']

def extractContentData(X):
        return X['contentVector']



svm_landing = Pipeline([
    ("extractLanding", FunctionTransformer(extractLandingData)),
    ("svmLanding", SVC(random_state=0, class_weight='balanced', kernel='linear', probability=True)),
])
svm_content = Pipeline([
    ("extractContent", FunctionTransformer(extractContentData)),
    ("svmContent", SVC(random_state=0, class_weight='balanced', kernel='linear', probability=True)),
])

stage_pipeline = FeatureUnion([
    ("svmForLanding", svm_landing),
    ("svmForContent", svm_content),
])

full_pipeline = Pipeline([
    ("stagePipeline", stage_pipeline),
    ("lr", LogisticRegression())
])

params = [
    {
        "stagePipeline__svmForLanding__svmLanding__C": [3,5,10],
        "full_pipeline__lr__C": [1, 5, 10],
        "full_pipeline__lr__penalty": ['l1', 'l2']
    }
]

grid_search = GridSearchCV(full_pipeline, params, cv=3, verbose=3, return_train_score=True, n_jobs=-1)
X_train = df[['landingVector', 'contentVector']]
y_train = df['label']
grid_search.fit(X_train, y_train)

Then I got an error message as

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in 23 stage_pipeline = FeatureUnion([ 24 ("svmForLanding", svm_landing), ---> 25 ("svmForContent", svm_content), 26 ]) 27

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in init(self, transformer_list, n_jobs, transformer_weights) 672 self.n_jobs = n_jobs 673 self.transformer_weights = transformer_weights --> 674 self._validate_transformers() 675 676 def get_params(self, deep=True):

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_transformers(self) 716 raise TypeError("All estimators should implement fit and " 717 "transform. '%s' (type %s) doesn't" % --> 718 (t, type(t))) 719 720 def _iter(self):

TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None, steps=[('extractLanding', FunctionTransformer(accept_sparse=False, check_inverse=True, func=, inv_kw_args=None, inverse_func=None, kw_args=None, pass_y='deprecated', validate=None)), ('svmLanding', SVC(C=1.0, cache_size=200...inear', max_iter=-1, probability=True, random_state=0, shrinking=True, tol=0.001, verbose=False))])' (type ) doesn't

Vatsal Gupta Vatsal Gupta · Accepted Answer · 2019-08-23T16:33:02

Say you are dividing your ensemble into 2 stages. 1. Stage 1 models i.e model1 and model2. 2. Logistic Regression model that is built upon the output of the stage 1 models.

So, you can use GridSearchCV in the first stage. This will help you in finding the best paramaters. Since, the GridSearchCV internally uses cross validation and has a parameter 'cv' for the number of folds. The best parameters are selected on different folds of data.

For the Stage 2 model i.e. logistic Regression, you dont really need to do GridSearchCV. But, can still use 'cross_val_score' which will calculate the score on different subsets of data

How to do cross validation and grid search if I have a customized ensemble model in python pipeline

2 Answers