5
votes

I have a very large matrix (about 500000 * 20000) containing the data that I would analyze with pca. To do this I'm using ParallelColt library, but both using singular value decomposition and eigenvalues decomposition in order to get the eigenvectors and eigenvalues of the covariance matrix. But these methods waste the heap and I get "OutOfMemory" errors...

Also using SparseDoubleMatrix2D (the data are very sparse) the errors still remain, so I ask you : how can I solve this problem ?

Change library ?

3
Is Java the only language considered, I can imagine this matrix is insanely big......?Roman Byshko
I don't see how switching to another language would change anything.duffymo

3 Answers

2
votes

You can compute PCA with Oja's rule : it's an iterative algorithm, improving an estimate of the PCA, one vector a time. It's slower than the usual PCA, but requires you to store only one vector in memory. It's also very numerically stable

http://en.wikipedia.org/wiki/Oja%27s_rule

0
votes

I'm not sure that changing libraries will help. You're going to need doubles (8 bytes per). I don't know what the dimension of the covariance matrix would be in this case, but switching libraries won't change the underlying calculations much.

What is the -Xmx setting when you run? What about the perm gen size? Perhaps you can increase them.

Does the algorithm halt immediately or does it run for a while? If it's the latter, you can attach to the process using Visual VM 1.3.3 (download and install all the plugins). It'll let you see what's happening on the heap, threads, etc. Could help you ferret out the root cause.

A Google search for "Java eigenvalue of large matricies" turned up this library from Google. If you scroll down in the comments I wonder of a block Lanczos eigenvalue analysis might help. It might be enough if you can get a subset of the eigenvalues.

These SVM implementations claim to be useful for large datasets:

http://www.support-vector-machines.org/SVM_soft.html

I don't think you can ask for more than 2GB for a JVM:

http://www.theserverside.com/discussions/thread.tss?thread_id=26347

According to Oracle, you'll need a 64-bit JVM running on a 64-bit OS:

http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#gc_heap_32bit

0
votes

I built some sparse, incremental algorithms for just this sort of problem. Conveniently, it's built on top of Colt.

See the HallMarshalMartin class in trickl-cluster library below. You can feed it chunks of rows at a time, so it should solve your memory issues.

The code is available under the GPL. I'm afraid I've only just released it, so it's short on documentation, hopefully it's fairly self explanatory. There are JUnit tests that should help with usage.

http://open.trickl.com/trickl-pca/index.html