2
votes

I'm currently using a 10% sample of a very large dataset (10 vars, over 300m rows) which amounts to over 200 GB of data when stored in .dta format for the full dataset. Stata is able to handle operations like egen, collapse, merging, etc in a reasonable amount of time for the 10% sample when using Stata-MP on a UNIX server with ~50G of RAM and multiple cores.

However, now I want to move on to analyzing the whole sample. Even if I use a machine that has enough RAM to hold the dataset, simply generating a variable takes ages. (I think perhaps the background operations are causing Stata to run into virtual mem)

The problem is also very amenable to parallelization, i.e., the rows in the dataset are independent of each other, so I can just as easily think about the one large dataset as 100 smaller datasets.

Does anybody have any suggestions for how to process/analyze this data or can give me feedback on some suggestions I currently have? I mostly use Stata/SAS/MATLAB so perhaps there are other approaches that I am simply unaware of.

Here are some of my current ideas:

  1. Split the dataset up into smaller datasets and utilize informal parallel processing in Stata. I can run my cleaning/processing/analysis on each partition and then merge the results after without having the store all the intermediate parts.

  2. Use SQL to store the data and also perform some of the data manipulation such as aggregating over certain values. One concern here is that some tasks that Stata can handle fairly easily such as comparing values across time won't work so well in SQL. Also, I'm already running into performance issues when running some queries in SQL on a 30% sample of the data. But perhaps I'm not optimizing by indexing correctly, etc. Also, Shard-Query seems like it could help with this but I have not researched it too thoroughly yet.

  3. R also looks promising, but I'm not sure if it would solve the problem of working with this enormous amount of data.

2
JSS has a timely article covering such issues from R's perspective. Might be of some help, however the size of the data they are using in examples is 12GB I think.radek

2 Answers

2
votes

Thanks to those who have commented and replied. I realized that my problem is similar to this thread. I have re-written some of my data manipulation code in Stata into SQL and the response time is much quicker. I believe I can make large optimization gains by correctly utilizing indexes and using parallel processing via partitions/shards if necessary. After all the data manipulation has been done, I can import that data via ODBC in Stata.

1
votes

Since you are familiar with Stata there is a well documented FAQ about large data sets in Stata Dealing with Large Datasets: you might find this helpful.

I would clean via columns, splitting those up, running any specific cleaning routines and merge back in later.

Depending on your machine resources, you should be able to hold the individual columns in multiple temporary files using tempfile. Taking care to select only the variables or columns most relevant to your analysis should reduce the size of your set quite a lot.