Let's say I want to use a LinearSVC to perform k-fold-cross-validation on a dataset. How would I perform standardization on the data?
The best practice I have read is to build your standardization model on your training data then apply this model to the testing data.
When one uses a simple train_test_split(), this is easy as we can just do:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
clf = svm.LinearSVC()
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
How would one go about standardizing data while doing k-fold-cross-validation? The problem comes from the fact that every data point will be for training/testing so you cannot standardize everything before cross_val_score(). Wouldn't you need a different standardization for each cross validation?
The docs do not mention standardization happening internally within the function. Am I SOL?
EDIT: This post is super helpful: Python - What is exactly sklearn.pipeline.Pipeline?