Beware of sample
for splitting if you look for reproducible results. If your data changes even slightly, the split will vary even if you use set.seed
. For example, imagine the sorted list of IDs in you data is all the numbers between 1 and 10. If you just dropped one observation, say 4, sampling by location would yield a different results because now 5 to 10 all moved places.
An alternative method is to use a hash function to map IDs into some pseudo random numbers and then sample on the mod of these numbers. This sample is more stable because assignment is now determined by the hash of each observation, and not by its relative position.
For example:
require(openssl) # for md5
require(data.table) # for the demo data
set.seed(1) # this won't help `sample`
population <- as.character(1e5:(1e6-1)) # some made up ID names
N <- 1e4 # sample size
sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids
sample2 <- sample1[-sample(N, 1)] # randomly drop one observation from sample1
sample1
sample2
nrow(merge(sample1, sample2))
[1] 9999
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]
test2 <- sample2[test, .(id)]
nrow(test1)
[1] 5000
nrow(merge(test1, test2))
[1] 2653
md5_bit_mod <- function(x, m = 2L) {
as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))
[1] 5057
nrow(test1a)
[1] 5057
sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.
See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html
and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo
x
can be the index (row/col nos. say) of yourdata
.size
can be0.75*nrow(data)
. Trysample(1:10, 4, replace = FALSE, prob = NULL)
to see what it does. - harkmug