Generating testing and training datasets with replacement in R

Question

I have mirrored some code to perform an analysis, and everything is working correctly (I believe). However, I am trying to understand a few lines of code related to splitting the data up into 40% testing and 60% training sets.

To my current understanding, the code randomly assigns each row into group 1 or 2. Subsequently, all the the rows assigned to 1 are pulled into the training set, and the 2's into the testing.

Later, I realized that sampling with replacement is not want I wanted for my data analysis. Although in this case I am unsure of what is actually being replaced. Currently, I do not believe it is the actual data itself being replaced, rather the "1" and "2" place holders. I am looking to understand exactly how these lines of code work. Based on my results, it seems as it is working accomplishing what I want. I need to confirm whether or not the data itself is being replaced.

To test the lines in question, I created a dataframe with 10 unique values (1 through 10).

If the data values themselves were being sampled with replacement, I would expect to see some duplicates in "training1" or "testing2". I ran these lines of code 10 times with 10 different set.seed numbers and the data values were never duplicated. To me, this suggest the data itself is not being replaced.

If I set replace= FALSE I get this error:

Error in sample.int(x, size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'

set.seed(8)
test  <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))

training1 <- df[test==1,]
testing2 <- df[test==2,]

Id like to split up my data into 60-40 training and testing. Although I am not sure that this is actually happening. I think the prob function is not doing what I think it should be doing. I've noticed the prob function does not actually split the data exactly into 60percent and 40percent. In the case of the n=10 example, it can result in 7 training 2 testing, or even 6 training 4 testing. With my actual larger dataset with ~n=2000+, it averages out to be pretty close to 60/40 (i.e., 60.3/39.7).

Mankind_008 Mankind_008 · Accepted Answer · 2019-07-29T01:38:48

The way you are sampling is bound to result in a undesired/ random split size unless number of observations are huge, formally known as law of large numbers. To make a more deterministic split, decide on the size/ number of observation for the train data and use it to sample from nrow(df):

set.seed(8)

# for a 60/40 train/test split
train_indx = sample(x = 1:nrow(df),
                    size = 0.6*nrow(df), 
                    replace = FALSE)

train_df <- df[train_indx,]
test_df <- df[-train_indx,]

Generating testing and training datasets with replacement in R

2 Answers