1
votes

Here is what I'm trying to do:

Create a new column that assigns a sample rank to multiple subsets of rows based on how many rows there are in each subset. The grouping variable is the 'stratum' column.

I usually randomly assign rank using nested ifelse statements as shown below. Sometimes this suffices, but lately, I've been dealing with more and more groupings. 40 nested ifelse statements can start to look a little excessive.

Is there a more elegant/quicker/minimal code way to do this using dplyr or data.table, maybe in conjunction with apply, lapply, sapply etc.?

I have tried to use data.table statements but i do not know how to insert the sample function using nrow.

Reproducible data:

dta <- data.frame(
     uniqueID = c(950513, 951634, 951640, 951641,951646, 952732, 952895, 952909, 952910, 952911, 952912,952923, 952924, 952925, 952926, 952927, 952928L, 952933, 
           952934, 952935),
     stratum = c("group9","group6","group15","group13","group9","group8","group9","group15","group15","group15","group15", "group13", "group13", 
          "group1", "group1", "group1", "group1", "group1", "group1", "group1")
)

Here is how I usually assign a random rank, using netsed ifelse statement:

dta<- dta[order(dta$stratum),]  
set.seed(7265)                                                                                                                 

dta$rank <- ifelse(dta$stratum== "group1",sample(1:nrow(dta[dta$stratum== "group1",])),
               ifelse(dta$stratum=="group6",sample(1:nrow(dta[dta$stratum== "group6",])),
                      ifelse(dta$stratum=="group8",sample(1:nrow(dta[dta$stratum== "group8",])),
                             ifelse(dta$stratum=="group9",sample(1:nrow(dta[dta$stratum== "group9",])),
                                    ifelse(dta$stratum=="group13",sample(1:nrow(dta[dta$stratum== "group13",])),
                                           ifelse(dta$stratum=="group15",sample(1:nrow(dta[dta$stratum== "group15",])),
                                                  0))))))
3

3 Answers

2
votes

Using dplyr, you can do

library(dplyr)
dta %>% 
    group_by(stratum) %>% 
    mutate(rank=sample.int(n()))

The group_by allows you to operate on a subset of rows at a time and we use the built in n() function from dplyr to get the number of rows in each group. I chose to use the more efficient sample.int rather than sample but it basically does the same thing.

In general, nested if-else statements are better handled with case_when() in dplyr, but what you were doing in this case is better handled with a group_by()

2
votes

Consider base R's by, designed to split dataframes by factor(s):

dta$rank <- unlist(by(dta, dta$stratum, FUN=function(df) sample(1:nrow(df))))

#    uniqueID stratum rank
# 14   952925  group1    6
# 15   952926  group1    2
# 16   952927  group1    1
# 17   952928  group1    3
# 18   952933  group1    5
# 19   952934  group1    7
# 20   952935  group1    4
# 4    951641 group13    2
# 12   952923 group13    1
# 13   952924 group13    3
# 3    951640 group15    1
# 8    952909 group15    3
# 9    952910 group15    5
# 10   952911 group15    2
# 11   952912 group15    4
# 2    951634  group6    1
# 6    952732  group8    1
# 1    950513  group9    2
# 5    951646  group9    1
# 7    952895  group9    3
1
votes

Solution using data.table:

library(data.table)
setDT(dta)[, rank := sample(1:.N), stratum]

 #     uniqueID stratum rank
 #  1:   952925  group1    4
 #  2:   952926  group1    2
 #  3:   952927  group1    1
 #  4:   952928  group1    6
 #  5:   952933  group1    7
 #  6:   952934  group1    3
 #  7:   952935  group1    5
 #  8:   951641 group13    2
 #  9:   952923 group13    1
 # 10:   952924 group13    3
 # ...

Explanation:

  1. Transform object into a data.table (setDT())
  2. Sample rank per group (, stratum]) from 1 to .N (how many rows there are in each group)