3
votes

I have a data set with more than 2 millions entries which I load into a data frame.

I'm trying to grab a subset of the data. I need around 10000 entries but I need the entries to be picked with equal probability on one variable.

This is what my data looks like with str(data):

'data.frame':   2685628 obs. of  3 variables:
$ category   : num  3289 3289 3289 3289 3289 ...
$ id: num  8064180 8990447 747922 9725245 9833082 ...
$ text    : chr  "text1" "text2" "text3" "text4" ...

You've noticed that I have 3 variables : category,id and text.

I have tried the following :

> sample_data <- data[sample(nrow(data),10000,replace=FALSE),]

Of course this works, but the probability of sample if not equal. Here is the output of count(sample_data$category) :

      x freq
1  3289  707
2  3401  341
3  3482  160
4  3502  243
5  3601 1513
6  3783  716
7  4029  423
8  4166   21
9  4178  894
10 4785   31
11 5108  121
12 5245 2178
13 5637  387
14 5946 1484
15 5977  117
16 6139  664

Update: Here is the output of count(data$category) :

  x   freq
1  3289 198142
2  3401  97864
3  3482  38172
4  3502  59386
5  3601 391800
6  3783 201409
7  4029 111075
8  4166   6749
9  4178 239978
10 4785   6473
11 5108  32083
12 5245 590060
13 5637  98785
14 5946 401625
15 5977  28769
16 6139 183258

But when I try setting the probability I get the following error :

> catCount <- length(unique(data$category))
> probabilities <- rep(c(1/catCount),catCount)
> train_set <- data[sample(nrow(data),10000,prob=probabilities),]
Error in sample.int(x, size, replace, prob) : 
incorrect number of probabilities

I understand that the sample function is randomly picking between the row number but I can't figure out how to associate that with the probability over the categories.

Question : How can I sample my data over an equal probability for the category variable?

Thanks in advance.

1
Check out the stratified function from my "splitstackshape" package perhaps?A5C1D2H2I1M1N2O1R2T1
That seems interesting. I don't know that packages. I'll give it a look. Thankseliasah
Are you trying to get an equal number of cases for each unique x?James
Yes that's exactly what I want to achieve.eliasah
Perhaps also interesting is from package dplyr sample_n and sample_frac combined with group_by.talat

1 Answers

5
votes

I guess you could do this with some simple base R operation, though you should remember that you are using probabilities here within sample, thus getting the exact amount per each combination won't work using this method, though you can get close enough for large enough sample.

Here's an example data

set.seed(123)
data <- data.frame(category = sample(rep(letters[1:10], seq(1000, 10000, by = 1000)), 55000))

Then

probs <- 1/prop.table(table(data$category)) # Calculating relative probabilities
data$probs <- probs[match(data$category, names(probs))] # Matching them to the correct rows
set.seed(123)
train_set <- data[sample(nrow(data), 1000, prob = data$probs), ] # Sampling
table(train_set$category) # Checking frequencies
#  a   b   c   d   e   f   g   h   i   j 
# 94 103  96 107 105  99 100  96 107  93 

Edit: So here's a possible data.table equivalent

library(data.table)
setDT(data)[, probs := .N, category][, probs := .N/probs]
train_set <- data[sample(.N, 1000, prob = probs)]

Edit #2: Here's a very nice solution using the dplyr package contributed by @Khashaa and @docendodiscimus

The nice thing about this solution is that it returns the exact sample size within each group

library(dplyr)
train_set <- data %>% 
             group_by(category) %>% 
             sample_n(1000)

Edit #3: It seems that data.table equivalent to dplyr::sample_n would be

library(data.table)
train_set <- setDT(data)[data[, sample(.I, 1000), category]$V1]

Which will also return the exact sample size within each group