Generate samples with big dataset

Question

I have a large dataset and sine it's large I have to either split it or load one variable at a time. I have loaded the unique identifier id and I need to select at random 50 observations 100 times. I searched and I found sample and runiform to generate the random sample, however my problem is that I need to generate 100 samples with 50 observation each, since I need to sample from the entire dataset which is large, I can only keep one variable in memory and so I would need to save the result of the sampling 100 times. I know I could use the for cycle however it's not efficient and it takes a lot of time for even 10 cycles, is there a faster way to generate more than one samples? Here's my code:

scalar i=0
forvalues i=1(1)100{
clear all
use generated/data1.dta
sample 50,count
save generated/sample`i'.dta,replace
merge 1:m id using generated/10m.dta
keep if _merge==3 |_merge==1
drop _merge
compress
save generated/sample`i'.dta,replace
}

My original file is panel data, and I split the original into pieces so that it can be handled, now I need to select 100 random samples, in the code I did that with the for cycle but I don't think it's the efficient way to go. To better describe the problem I have a dataset of firms with daily observation of price, return, date dividend and so on, the problem is that the original file is very big therefore to load it in memory I had to split it in 6 piece so that Stata could load it. Now I need to select 100 samples with 50 firms for each sample and I'm doing that with this cycle:

***Generate 100 samples***
scalar i=0
forvalues i=1(1)100{
clear all
***Select 50 companies at random***
use generated/ids.dta
sample 50,count
***Merge with part1 of the original file***
merge 1:m permno using generated/ids10m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
***Keep in the both file all the ids***
save generated/both`i'.dta,replace
drop if date==.
***Fill the sample`i' with ids which have a correspondence with the date***
save generated/sample`i'.dta,replace

clear all
***Open the both file and keep only the non-match ids***
use generated/both`i'.dta,replace
keep if date==.
keep id
***Keep the non-matched ids to check at the end what's in there***
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id20m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace

clear all
use generated/both`i'.dta,replace
keep if date==.
keep id
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id30m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace

use generated/both`i'.dta,replace
keep if date==.
keep id
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id40m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace

use generated/both`i'.dta,replace
keep if date==.
keep id
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id50m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace

use generated/both`i'.dta,replace
keep if date==.
keep id
save generated/rplc`i'.dta, replace
merge 1:m id using generated/id60m.dta
keep if _merge==1 | _merge==3
drop _merge
compress
save generated/both`i'.dta,replace
drop if date==.
append using generated/sample`i'.dta
save generated/sample`i'.dta,replace
erase generated/both`i'.dta
erase generated/rplc`i'.dta
}

Now, the problem with this code is that it takes approximately 40 minutes to create the 100 samples, is there a faster way to do the same thing?

This is an event study, size is not a problem here, the problem is not the sampling but the efficiency of the loop.

Can you show us some code? Does it work? Is it only an efficiency issue? What exactly do you mean by a "large" dataset? Please see the Asking section of the help center and stackoverflow.com/help/mcve. — Roberto Ferrer
How many variables do you actually need from the original data set? Are you using sample and runiform() at the same time? If yes, why? How do you want the samples to be organized? One file per sample, one big file, etc. It's difficult to help without a clear question. — Roberto Ferrer
What structure does your original data have? Is it panel data, cross-sectional? What structure are you aimimng for? Please, if you decide to answer any of these, do so editing your original post, and not through comments in the comments section. — Roberto Ferrer
Thanks for the edit. However, many things still remain unclear. — Roberto Ferrer
Among them: 1) Exactly how many observations are there in the data set? 2) How many companies? 3) What's the purpose of the study? 4) Why do you need 100 samples? How did you choose the n = 50 for each? 5) In many studies of establishments, size (however defined) is often an important classification variable; in such cases stratification by size or sampling with probability proportional to size can be advantageous. Is that the case here? — Steve Samuels

Steve Samuels Steve Samuels · Accepted Answer · 2015-03-02T23:14:34

One place for improvement stands out in your do file: You are doing 600 merges of sampled ids with "big" files. Here is code that requires only one merge for each big file, or six, in your case. The trick is to reshape the sample data from long format to wide, with one line per id and indicators for the sample(s) that selected the id. After merging this file with each big data set, re-assemble the samples. The toy example has two big files and three samples.

clear
/* Generate 1st BIG data set */
input id  t
10  1
10  2
40  1
40  2
40  3
50  1
50  2
55  2
55  6
90  2
90  3

end
save big1, replace

* Generate 2nd BIG data set */
clear
input id  t
90   4
90   5
100  1
100  2
100  3
140  1
140  2
143  2
155  1
155  2
180  2
180  3
end
save big2, replace

/* Generate three sample data sets-you'll do this with "sample" */
clear
input sample id
1  40
1  180
end
tempfile samp1
save `samp1'

clear
input sample id
2  10
2   90
end
tempfile samp2
save `samp2'
clear
input  sample id
3  100
3  155
end

/* Step 1. Append all samples */
append using `samp1' `samp2'
order id sample
sort id sample
list


/* Step 2. Reshape Wide with one obs per id */
tempvar set
gen `set' = sample
reshape wide sample, i(id) j(`set')
tempfile t2
save `t2'
list

/* Step 3. Merge sample ids with each "big" data set
   and append the results  */

clear
tempfile t3
save `t3', emptyok replace

forvalues g = 1/2 {
use big`g', clear
merge m:1  id using `t2'
keep if _merge ==3
drop _merge
append using `t3'
save `t3', replace
}

sort id t
list, sepby(id)

/* Step 4: Reassemble samples with data into
  one data set, saved in advance */
clear
/*temporary file to hold results */
tempfile allsamps
save `allsamps', emptyok

/* Cycle through samples: change 3 to n. of samples */
forvalues i = 1/3 {
use `t3', clear
gen sample = `i' if sample`i'==`i'
drop if sample==.
append using `allsamps'
save `allsamps', replace
}
drop sample?

order sample id
sort sample id t

save allsamples,replace
list, sepby(sample)

with result:

     +------------------+
     | sample    id   t |
     |------------------|
  1. |      1    40   1 |
  2. |      1    40   2 |
  3. |      1    40   3 |
  4. |      1   180   2 |
  5. |      1   180   3 |
     |------------------|
  6. |      2    10   1 |
  7. |      2    10   2 |
  8. |      2    90   2 |
  9. |      2    90   3 |
 10. |      2    90   4 |
 11. |      2    90   5 |
     |------------------|
 12. |      3   100   1 |
 13. |      3   100   2 |
 14. |      3   100   3 |
 15. |      3   155   1 |
 16. |      3   155   2 |
     +------------------+

Some other observations:

You could check what parts of the do file are taking longest by setting timers in Stata. See the help for timer.
sample requires sorts of the data. If the id file is very large, consider a method of taking the sample that don't require sorts. Fan et al.(1962) described sampling plans with this feature.

a. Sequential Sampling (Fan Method 1). See page 1 (401) of Chromy, 1979

b. Systematic sampling

In place of a random sample with n = 50, take 10 systematic sub-samples of size 5 for each of your 100 samples. Each sample is treated as a cluster, so provides valid standard errors. A further advantage accrues if you can informatively sort the list of company ids (by size, sector, for example). Then systematic sub-samples will spread over the entire list. See Deming (1960) for many examples.'

References

Chromy, JR. 1979. Sequential sample selection methods. Proceedings of the Survey Research Methods Section of the American Statistical Association 401-406. It can be found at http://www.amstat.org/sections/srms/Proceedings/papers/1979_081.pdf

Deming WE (1960), Sample Design in Business Research, Wiley, New York.

Fan, C. T., Muller, Mervin E., and Rezucha, Ivan (1962), "Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers," Journal of the American Statistical Association, 57, 387-402.

Generate samples with big dataset

1 Answers