Python: Is there a way to randomly split data from pandas dataframe into train and validation set

Question

I have a sparse matrix each columns contains price of a future. I hope to randomly split the data into two sets. I understand that train_test_split in sklearn can randomly split data into two sets, however, it cannot satisfy my needs:

The randomly selected data should exclude nans
Extracting different size of data from each column.(eg.first column contains 10000 not nan cells,second contains 5000, I need to extract 2000 cells from first column and 500 from second column as train set, rest as validation set)

Is there time saving way to do this?

You should probably just use pd.Series.sample() with different values of sampling for different columns and then concatenate resulting columns into a dataframe. — pavel
What does sparse matrix have to do with pandas dataframe? Seriously consider casting your data into a form that sklearn can easily split. If it can't split it, it probably can't learn from it either. — hpaulj
Thanks for your reply. But I think the pd.Series.sample() still cannot exclude nans, it doesn't matter what kind of data form should be used, I just need to achieve the above mentioned goals without using too many loops — Serpent_Beginer

XXDIL XXDIL · Accepted Answer · 2020-12-31T08:49:23

You can try the following:

# Randomize the dataset
data_randomized = sms_spam.sample(frac=1, random_state=1)

# Calculate index for split 80:20 ratio
training_test_index = round(len(data_randomized) * 0.8)

# Split into training and test sets
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

source : link

Python: Is there a way to randomly split data from pandas dataframe into train and validation set

1 Answers