0
votes

I am building a customized ensemble model and would like to do cross validation and grid search in python using pipeline. How can I do it?

I have a data set contains web content. What I want to do is

  1. Split the content from one single webpage into to two part. The reason of splitting is because the text are from different places of the page and I want to handle them separately.

  2. I train a model1 using only the features from part1, and train a model2 using only the features from part2.

  3. Assume I got a score from model1 as S1, and score from model2 as S2. I train another model, saying logistic regression model, to ensemble these two scores to a final score S.

through this whole process, is there a way that I can use ML pipeline in sklearn to do cross validation and grid search?


I appreciate Dev's reply below, however when I tried to do the same thing I am encountering new problems. I have code as following:

data = pd.DataFrame(columns = ['landingVector', 'contentVector', 'label'])

def extractLandingData(X):
        return X['landingVector']

def extractContentData(X):
        return X['contentVector']



svm_landing = Pipeline([
    ("extractLanding", FunctionTransformer(extractLandingData)),
    ("svmLanding", SVC(random_state=0, class_weight='balanced', kernel='linear', probability=True)),
])
svm_content = Pipeline([
    ("extractContent", FunctionTransformer(extractContentData)),
    ("svmContent", SVC(random_state=0, class_weight='balanced', kernel='linear', probability=True)),
])

stage_pipeline = FeatureUnion([
    ("svmForLanding", svm_landing),
    ("svmForContent", svm_content),
])

full_pipeline = Pipeline([
    ("stagePipeline", stage_pipeline),
    ("lr", LogisticRegression())
])

params = [
    {
        "stagePipeline__svmForLanding__svmLanding__C": [3,5,10],
        "full_pipeline__lr__C": [1, 5, 10],
        "full_pipeline__lr__penalty": ['l1', 'l2']
    }
]

grid_search = GridSearchCV(full_pipeline, params, cv=3, verbose=3, return_train_score=True, n_jobs=-1)
X_train = df[['landingVector', 'contentVector']]
y_train = df['label']
grid_search.fit(X_train, y_train)

Then I got an error message as

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in 23 stage_pipeline = FeatureUnion([ 24 ("svmForLanding", svm_landing), ---> 25 ("svmForContent", svm_content), 26 ]) 27

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in init(self, transformer_list, n_jobs, transformer_weights) 672 self.n_jobs = n_jobs 673 self.transformer_weights = transformer_weights --> 674 self._validate_transformers() 675 676 def get_params(self, deep=True):

~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_transformers(self) 716 raise TypeError("All estimators should implement fit and " 717 "transform. '%s' (type %s) doesn't" % --> 718 (t, type(t))) 719 720 def _iter(self):

TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None, steps=[('extractLanding', FunctionTransformer(accept_sparse=False, check_inverse=True, func=, inv_kw_args=None, inverse_func=None, kw_args=None, pass_y='deprecated', validate=None)), ('svmLanding', SVC(C=1.0, cache_size=200...inear', max_iter=-1, probability=True, random_state=0, shrinking=True, tol=0.001, verbose=False))])' (type ) doesn't

2
take a look at vecstackShihab Shahriar Khan

2 Answers

0
votes

Say you are dividing your ensemble into 2 stages. 1. Stage 1 models i.e model1 and model2. 2. Logistic Regression model that is built upon the output of the stage 1 models.

So, you can use GridSearchCV in the first stage. This will help you in finding the best paramaters. Since, the GridSearchCV internally uses cross validation and has a parameter 'cv' for the number of folds. The best parameters are selected on different folds of data.

For the Stage 2 model i.e. logistic Regression, you dont really need to do GridSearchCV. But, can still use 'cross_val_score' which will calculate the score on different subsets of data

0
votes

Yes you can use GridSearchCv or RandomizedSearchCv to find best hyper parameters for your pipe line model.

  • you can define your model as combination of pipelines sequentially or in parallel
  • Then you can use the final pipeline in the GridSearchCV
  • In the grid_params you can reference hyper parameter for each inner pipeline by concatenating names of pipeline with "__" double underscore

please look at the following example of case similar to yours. See how the pipelines are chained and how the hyper parameter for pipeline item is referenced in grid_params

email_body_to_wordcount = Pipeline([
    ("convert_to_text", MapTransformer(email_to_text)),
    ("strip_html", MapTransformer(strip_html)),
    ("replace_urls", MapTransformer(replace_urls)),
    ("replace_numbers", MapTransformer(replace_numbers)),
    ("replace_non_word_characters", MapTransformer(replace_non_word_characters)),
    ("count_word_stem", CountStemmedWord()),   
], memory="cache")

subject_to_wordcount =  Pipeline([

    ("process_text", Pipeline([
        ("get_subject", MapTransformer(get_email_subject)),
        ("replace_numbers", MapTransformer(replace_numbers)),
        ("replace_non_word_characters", MapTransformer(replace_non_word_characters)),
    ], memory="cache")),
    ("count_word_stem", CountStemmedWord(importance=5)),
])

email_to_word_count = FeatureUnion([
    ("email_to_wordcount", email_body_to_wordcount),
    ("subject_to_wordcount", subject_to_wordcount)
])

content_type_pipeline = Pipeline([
   ("get_content_type", MapTransformer(email.message.EmailMessage.get_content_type)),
    ("binarize", LblBinarizer())
])

email_len_transform = Pipeline([
    ("convert_to_text", MapTransformer(email_to_text)),
    ("get_email_len", MapTransformer(len)),

])

email_to_word_vector = Pipeline([
    ("email_to_word_count", email_to_word_count),
    ("word_count_to_vector", WordCountsToVector())
])

full_pipeline = FeatureUnion([
    ("email_to_word_vector", email_to_word_vector),
    ("content_type_pipeline", content_type_pipeline),
    ("email_len_transform", email_len_transform)
])

predict_pipeline = Pipeline([
    ("full_pipeline", full_pipeline),
    ("predict", RandomForestClassifier(n_estimators = 5))
])

params = [
    {
        "full_pipeline__email_to_word_vector__email_to_word_count__email_to_wordcount" +
        "__count_word_stem__importance": [3,5],
        "full_pipeline__email_to_word_vector" +
        "__word_count_to_vector__vocabulary_len": [500,1000,1500]
    }
]

grid_search = GridSearchCV(predict_pipeline, params, cv=3, verbose=3, return_train_score=True)
grid_search.fit(X_train, y_train)

Edited Pipeline will use fit and transform method to the transformers so your transformers should implement those methods. You can implement custom transformer like below and use it instead of SVC classifier

from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin):
    def fit(self,X,y):
        self.svc = SVC() #initialize your svc here
        return self

    def transform(self,X,y=None):
        return self.svc.predict(X)