Say I have a dataset and I want to assign observations to different groups, the size of groups determined by the data. For example, suppose that this is the data:
sysuse census, clear
keep state region pop
order state pop region
decode region, gen(reg)
replace reg="NCntrl" if reg=="N Cntrl"
drop region
*Create global with regions
global region NE NCntrl South West
*Count the number in each region
bys reg (pop): gen reg_N=_N
tab reg
There are four reg
groups, all of different sizes. Now, I want to randomly assign observations to the four groups. This is accomplished below by generating a random number and then assigning observations to one of the groups based on the random number.
*Generate random number
set seed 1
gen random = runiform()
sort random
*Assign observations to number based on random sorting
egen reg_rand = seq(), from(1) to (4)
*Map number to region
gen reg_new = ""
global count 1
foreach i in $region {
replace reg_new = "`i'" if reg_rand==$count
global count = $count + 1
}
bys reg_new: gen reg_new_N = _N
tab reg_new
This is not what I want, though. Instead of using the seq()
command, which creates groups of equal sizes (assuming N divided by number of groups is a whole number), I would like to randomly assign based on the size of the original groups. In this case, that is equivalent to reg_N
. For example, there would be 12 observations that have a reg_new
value of NCntrl
.
I might have one solution similar to https://stats.idre.ucla.edu/stata/faq/how-can-i-randomly-assign-observations-to-groups-in-stata/. The idea would be to save the results of tab reg
into a macro or matrix, and then use a loop and replace to cycle through the observations, which are sorted by a random number. Assume that there are many, many more groups than the four in this toy example. Is there a more reasonable way to accomplish this?