I have a large dataset and sine it's large I have to either split it or load one variable at a time. I have loaded the unique identifier id and I need to select at random 50 observations 100 times. I searched and I found sample and runiform to generate the random sample, however my problem is that I need to generate 100 samples with 50 observation each, since I need to sample from the entire dataset which is large, I can only keep one variable in memory and so I would need to save the result of the sampling 100 times. I know I could use the for cycle however it's not efficient and it takes a lot of time for even 10 cycles, is there a faster way to generate more than one samples? Here's my code:
scalar i=0
forvalues i=1(1)100{
clear all
use generated/data1.dta
sample 50,count
save generated/sample`i'.dta,replace
merge 1:m id using generated/10m.dta
keep if _merge==3 |_merge==1
drop _merge
compress
save generated/sample`i'.dta,replace
}
My original file is panel data, and I split the original into pieces so that it can be handled, now I need to select 100 random samples, in the code I did that with the for cycle but I don't think it's the efficient way to go. To better describe the problem I have a dataset of firms with daily observation of price, return, date dividend and so on, the problem is that the original file is very big therefore to load it in memory I had to split it in 6 piece so that Stata could load it. Now I need to select 100 samples with 50 firms for each sample and I'm doing that with this cycle:
***Generate 100 samples***
scalar i=0
forvalues i=1(1)100{
clear all
***Select 50 companies at random***
use generated/ids.dta
sample 50,count
***Merge with part1 of the original file***
merge 1:m permno using generated/ids10m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
***Keep in the both file all the ids***
save generated/both`i'.dta,replace
drop if date==.
***Fill the sample`i' with ids which have a correspondence with the date***
save generated/sample`i'.dta,replace
clear all
***Open the both file and keep only the non-match ids***
use generated/both`i'.dta,replace
keep if date==.
keep id
***Keep the non-matched ids to check at the end what's in there***
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id20m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace
clear all
use generated/both`i'.dta,replace
keep if date==.
keep id
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id30m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace
use generated/both`i'.dta,replace
keep if date==.
keep id
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id40m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace
use generated/both`i'.dta,replace
keep if date==.
keep id
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id50m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace
use generated/both`i'.dta,replace
keep if date==.
keep id
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id60m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace
erase generated/both`i'.dta
erase generated/rplc`i'.dta
}
Now, the problem with this code is that it takes approximately 40 minutes to create the 100 samples, is there a faster way to do the same thing?
This is an event study, size is not a problem here, the problem is not the sampling but the efficiency of the loop.
sample
andruniform()
at the same time? If yes, why? How do you want the samples to be organized? One file per sample, one big file, etc. It's difficult to help without a clear question. – Roberto Ferrer