3
votes

How to divide a given dataset into train and test sets along with their correct labels.

There is an implementation for same through sklearn library :

from sklearn.cross_validation import train_test_split

train, test = train_test_split(df, test_size = 0.2)

where df is the original dataset....for eg : a list of strings

The problem is that it doesnt take the target/labels along with the data sets. So we cannot track which label belongs to what data point...

Is there any way to bind data points and their labels and then split the data sets into train and test?

1
What is df in your snippet above? - Ami Tavory
df is the original dataset or corpus - mach

1 Answers

4
votes

sklearn.cross_validation.train_test_split essentially takes a variable number of arrays which it will split

*arrays : sequence of arrays or scipy.sparse matrices with same shape[0]

Returns:
splitting : list of arrays, length=2 * len(arrays) List containing train-test split of input array.

so you can just add along the labels list:

from sklearn import cross_validation

df = ['the', 'quick', 'brown', 'fox']
labels = [0, 1, 0, 0]

>> cross_validation.train_test_split(df, labels, test_size=0.2)
[['quick', 'fox', 'the'], ['brown'], [1, 0, 0], [0]]