0
votes

I have a sparse matrix each columns contains price of a future. I hope to randomly split the data into two sets. I understand that train_test_split in sklearn can randomly split data into two sets, however, it cannot satisfy my needs:

  1. The randomly selected data should exclude nans
  2. Extracting different size of data from each column.(eg.first column contains 10000 not nan cells,second contains 5000, I need to extract 2000 cells from first column and 500 from second column as train set, rest as validation set)

Is there time saving way to do this?

1
You should probably just use pd.Series.sample() with different values of sampling for different columns and then concatenate resulting columns into a dataframe. - pavel
What does sparse matrix have to do with pandas dataframe? Seriously consider casting your data into a form that sklearn can easily split. If it can't split it, it probably can't learn from it either. - hpaulj
Thanks for your reply. But I think the pd.Series.sample() still cannot exclude nans, it doesn't matter what kind of data form should be used, I just need to achieve the above mentioned goals without using too many loops - Serpent_Beginer

1 Answers

0
votes

You can try the following:

# Randomize the dataset
data_randomized = sms_spam.sample(frac=1, random_state=1)

# Calculate index for split 80:20 ratio
training_test_index = round(len(data_randomized) * 0.8)

# Split into training and test sets
training_set = data_randomized[:training_test_index].reset_index(drop=True)
test_set = data_randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

source : link