0
votes

I have a dataframe with 3 features and 3 classes that I split into X_train, Y_train, X_test, and Y_test and then run Sklearn's Pipeline with PCA, StandardScaler and finally Logistic Regression. I want to be able to calculate the probabilities directly from the LR weights and the raw data without using predict_proba but don't know how because I'm not sure exactly how pipeline pipes X_test through PCA and StandardScaler into logistic regression. Is this realistic without being able to use PCA's and StandardScaler's fit method? Any help would be greatly appreciated!

So far, I have:

pca = PCA(whiten=True)
scaler = StandardScaler()
logistic = LogisticRegression(fit_intercept = True, class_weight = 'balanced', solver = sag, n_jobs = -1, C = 1.0, max_iter = 200)

pipe = Pipeline(steps = [ ('pca', pca), ('scaler', scaler), ('logistic', logistic) ]

pipe.fit(X_train, Y_train)

predict_probs = pipe.predict_proba(X_test)

coefficents = pipe.steps[2][1].coef_ (3 by 30)
intercepts = pipe.steps[2][1].intercept_ (1 by 3)
1
The X_train and X_test must go through the exact same transformation for the predicted results to be correct. What is the problem in using pipe.predict_proba(X_test)?Vivek Kumar
If you are worried that in the pipe, if you send X_test, pca and scaler will be fit again then dont worry. Only transform will be called in them and predict_probas on logistic.Vivek Kumar
The problem with pipe.predict_proba(X_test) is that the new test data will be fed in manually in real time so I just need a way to do the transform, I guess. when PCA and scaler transform X_test, it's just using the fit parameters from X_train, right?Jeremy
Yes. A pipeline behaves like any other estimator. You fit on training data and only call predict or transform on test data. When you call predict_proba on a pipeline, all estimators excluding the last one will only call transform and then pass the data further. The last one will call predict_probaVivek Kumar

1 Answers

0
votes

This is also the question I don't figure out, thanks for Kumar's answer. I regarded pipeline will lead to new transform for x_test, but when I tried to run Pipeline composed of StandardScalar and LogisticRegression, and to run my own defined function using StandardScalar and LogisticRegression, I found that Pipeline actually use the transform fitted by x_train. So don't worry about using pipeline, it's really a convenient and useful tool for machine learning!