Exclude certain indices from a KFold split in Python SKlearn

Question

I am using SKlearn KFold as follows:

        kf = KFold(10000, n_folds=5, shuffle=True, random_state=88)

However, I want to exclude certain indices from the training folds (only). How can this be achieved? Thanks.

I wonder if this can be achieved by using sklearn.cross_validation.PredefinedSplit?

Update: The KFold instance will be used with XGBoost for the folds parameter of xgb.cv. The Python API here states that folds should be "a KFold or StratifiedKFold instance".

However, I will try generating the KFolds as above, iterating over the train fold indices, modifying them, and then defining a custom_cv by hand like this:

custom_cv = zip(train_indices, test_indices)

What do you mean by "return them to the KFold object"? What are you trying to accomplish? — juanpa.arrivillaga
The KFold will be given to XGBoost for xgb.cv. I need to remove certain indices from the training folds before passing the KFold instance to xgb. — Chris Parry
I'm still not sure what you mean by "return them to the KFold object." — juanpa.arrivillaga
Using KFold, I split my training data into train and valid. I am going to pass the KFold instance to XGBoost, which will use it during its cross validation. However, before I do that, I want to exclude some specific indices from the training data only (not the valid data). An alternative way to do it is to use fpreproc, but it involves modifying a DMatrix object. Hope that clarifies. If there is a better way to exclude certain indices from the KFold split, please let me know. I will modify the question to clarify. — Chris Parry
I'm not familiar with XGBoost, but if you do something like kf_list = list(kf) it will return a list of tuples which is will be iterable in the same way as the KFold object and you can remove the indices you want from the tuples in the list. — juanpa.arrivillaga

juanpa.arrivillaga juanpa.arrivillaga · Accepted Answer · 2016-06-19T01:12:05

If you want to remove indices from the training set, but it is ok if they are in the testing set, then this approach will work:

kf_list = list(kf)

This will return a list of tuples that can be iterated over in the same way as the KFold instance. You can then simply modify the indices as you see fit, and your KFold instance will stay untouched. You can think of a KFold object as an array of integers, representing the indices, and methods that let you generate the folds on the fly.

Here's the source code, which is pretty straightforward, for the meaty part of how the iterator protocol is implemented :

https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/cross_validation.py#L254

def _iter_test_indices(self):
    n = self.n
    n_folds = self.n_folds
    fold_sizes = (n // n_folds) * np.ones(n_folds, dtype=np.int)
    fold_sizes[:n % n_folds] += 1
    current = 0
    for fold_size in fold_sizes:
        start, stop = current, current + fold_size
        yield self.idxs[start:stop]
        current = stop

Exclude certain indices from a KFold split in Python SKlearn

1 Answers