0
votes

I'm trying to fit logistic regression. I want to split training and testing data by account (a variable that doesn't play a role into fitting). I want them to be split by account, and each account can have lots of variables. For example, 80% of the account will be training, 20% account will be testing.

I've tried the following, but this code just give me 80% training and 20% testing randomly. Then in training data, it will give me some account, but in testing data, it will also give me that exactly account just with different variables. That's not what I want.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=0)

Please advise. Thank you!

1
Can I modify the code this way? X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=0, stratify = account)vicky
each account can have lots of variables - what does this mean?Supratim Haldar

1 Answers

0
votes

What about this

import numpy as np

def group_train_test_split(X, y, test_size, random_state, stratify):
    X = X.copy()
    X['_target'] = y

    X = X.set_index(stratify)
    index = X.index

    index_values = index.unique().values
    np.random.seed(random_state)
    np.random.shuffle(index_values)

    cut = np.round(index_values.shape[0] * test_size).astype('<i4')

    X_test, X_train = X.loc[index_values[:cut]], X.loc[index_values[cut:]]

    return X_test['_target'], X_train['_target'], X_test.drop('_target', axis=1), X_train.drop('_target', axis=1)

y_test, y_train, X_test, X_train = group_train_test_split(X=X, y=y, test_size=0.2, random_state=41, stratify='account')

This way it will take 20% of the accounts into test data, while the rest will be in the training data.