0
votes

I'm using H2O with a SVMLight sparse matrix of dimensions ~700,000 x ~800,000. The file size is approximately ~800MB on disk. But importing it into H2O takes up over 300GB of RAM? The process also takes too long (~15 minutes) to finish.

I can create and store the sparse matrix in RAM using the Matrix package rather quickly in comparison. The Sparse Matrix in that case takes ~1.2GB of RAM.

Below is my code:

library(h2o)
h2o.init(nthreads=-1,max_mem_size = "512g")

x <- h2o.importFile('test2.svmlight', parse = TRUE)

Here is my system:

openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)

Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
H2O cluster uptime:         2 seconds 76 milliseconds 
H2O cluster version:        3.14.0.3 
H2O cluster version age:    1 month and 8 days  
H2O cluster name:           H2O_started_from_R_ra2816_fhv677 
H2O cluster total nodes:    1 
H2O cluster total memory:   455.11 GB 
H2O cluster total cores:    24 
H2O cluster allowed cores:  24 
H2O cluster healthy:        TRUE 
H2O Connection ip:          localhost 
H2O Connection port:        54321 
H2O Connection proxy:       NA 
H2O Internal Security:      FALSE 
H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
R Version:                  R version 3.4.1 (2017-06-30) 

I would appreciate any advice because I really enjoy H2O and would like to use it for this project.

1

1 Answers

2
votes

H2O stores data in a columnar compressed store, and is optimized to work well with datasets that have a huge number (billions+) of rows and a large number (thousands+) of columns.

Each column is stored in a bunch of what H2O calls chunks. A chunk is a group of contiguous rows. A chunk may be sparse, so if a chunk contains 10,000 rows and they are all missing, the amount of memory needed by that chunk can be really small. But the chunk still needs to be there.

In practice, what that means is that H2O stores rows sparsely but does not store columns sparsely. So it won't store things as efficiently as a pure sparse matrix package for wide data.

In your specific case, 800,000 columns is pushing H2O's limits.

One thing some people don't know about H2O is that it handles categorical columns efficiently. So if you are getting column explosion by manually 1-hot-encoding your data, you don't need to do that with H2O. Another data representation would be more efficient.