I am building a customized ensemble model and would like to do cross validation and grid search in python using pipeline. How can I do it?
I have a data set contains web content. What I want to do is
Split the content from one single webpage into to two part. The reason of splitting is because the text are from different places of the page and I want to handle them separately.
I train a model1 using only the features from part1, and train a model2 using only the features from part2.
Assume I got a score from model1 as S1, and score from model2 as S2. I train another model, saying logistic regression model, to ensemble these two scores to a final score S.
through this whole process, is there a way that I can use ML pipeline in sklearn to do cross validation and grid search?
I appreciate Dev's reply below, however when I tried to do the same thing I am encountering new problems. I have code as following:
data = pd.DataFrame(columns = ['landingVector', 'contentVector', 'label'])
def extractLandingData(X):
return X['landingVector']
def extractContentData(X):
return X['contentVector']
svm_landing = Pipeline([
("extractLanding", FunctionTransformer(extractLandingData)),
("svmLanding", SVC(random_state=0, class_weight='balanced', kernel='linear', probability=True)),
])
svm_content = Pipeline([
("extractContent", FunctionTransformer(extractContentData)),
("svmContent", SVC(random_state=0, class_weight='balanced', kernel='linear', probability=True)),
])
stage_pipeline = FeatureUnion([
("svmForLanding", svm_landing),
("svmForContent", svm_content),
])
full_pipeline = Pipeline([
("stagePipeline", stage_pipeline),
("lr", LogisticRegression())
])
params = [
{
"stagePipeline__svmForLanding__svmLanding__C": [3,5,10],
"full_pipeline__lr__C": [1, 5, 10],
"full_pipeline__lr__penalty": ['l1', 'l2']
}
]
grid_search = GridSearchCV(full_pipeline, params, cv=3, verbose=3, return_train_score=True, n_jobs=-1)
X_train = df[['landingVector', 'contentVector']]
y_train = df['label']
grid_search.fit(X_train, y_train)
Then I got an error message as
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in 23 stage_pipeline = FeatureUnion([ 24 ("svmForLanding", svm_landing), ---> 25 ("svmForContent", svm_content), 26 ]) 27
~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in init(self, transformer_list, n_jobs, transformer_weights) 672 self.n_jobs = n_jobs 673 self.transformer_weights = transformer_weights --> 674 self._validate_transformers() 675 676 def get_params(self, deep=True):
~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_transformers(self) 716 raise TypeError("All estimators should implement fit and " 717 "transform. '%s' (type %s) doesn't" % --> 718 (t, type(t))) 719 720 def _iter(self):
TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None, steps=[('extractLanding', FunctionTransformer(accept_sparse=False, check_inverse=True, func=, inv_kw_args=None, inverse_func=None, kw_args=None, pass_y='deprecated', validate=None)), ('svmLanding', SVC(C=1.0, cache_size=200...inear', max_iter=-1, probability=True, random_state=0, shrinking=True, tol=0.001, verbose=False))])' (type ) doesn't