6
votes

How can I use the R packages zoo or xts with very large data sets? (100GB) I know there are some packages such as bigrf, ff, bigmemory that can deal with this problem but you have to use their limited set of commands, they don't have the functions of zoo or xts and I don't know how to make zoo or xts to use them. How can I use it?

I've seen that there are also some other things, related with databases, such as sqldf and hadoopstreaming, RHadoop, or some other used by Revolution R. What do you advise?, any other?

I just want to aggreagate series, cleanse, and perform some cointegrations and plots. I wouldn't like to need to code and implement new functions for every command I need, using small pieces of data every time.

Added: I'm on Windows

1
This is not a quantitative finance question. I'm sending this to Stack Overflow.chrisaycock
@skan You can have a look at mmap package which was created by Jeff Ryan (author of xts)CHP
Also see this post r.789695.n4.nabble.com/…CHP
But I'm using R for Windows, and nmap works on Linux. Then you think I cannot use packages such as ff, revoscaler or RHipe with zoo or to perform cointegrations or wavelet analysis?skan
The mmap package uses mmap on unix-alikes and MapViewOfFile on Windows. You don't need to know any of that to use the package, which is why I asked if you actually looked at (i.e. tried) the package. There's a vignette with examples and Jeff has several presentations floating around on the web.Joshua Ulrich

1 Answers

2
votes

I have had a similar problem (albeit I was only playing with 9-10 GBs). My experience is that there is no way R can handle so much data on its own, especially since your dataset appears to contain time series data.

If your dataset contains a lot of zeros, you may be able to handle it using sparse matrices - see Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this manual may also come handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )

I used PostgreSQL - the relevant R package is RPostgreSQL ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; it uses SQL syntax. Data is downloaded into R as a dataframe. It may be slow (depending on the complexity of your query), but it is robust and can be handy for data aggregation.

Drawback: you would need to upload data into the database first. Your raw data needs to be clean and saved in some readable format (txt/csv). This is likely to be the biggest issue if your data is not already in a sensible format. Yet uploading "well-behaved" data into the DB is easy ( see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV file data into a PostgreSQL table? )

I would recommend using PostgreSQL or any other relational database for your task. I did not try Hadoop, but using CouchDB nearly drove me round the bend. Stick with good old SQL