I'm using H2O with a SVMLight sparse matrix of dimensions ~700,000 x ~800,000. The file size is approximately ~800MB on disk. But importing it into H2O takes up over 300GB of RAM? The process also takes too long (~15 minutes) to finish.
I can create and store the sparse matrix in RAM using the Matrix package rather quickly in comparison. The Sparse Matrix in that case takes ~1.2GB of RAM.
Below is my code:
library(h2o)
h2o.init(nthreads=-1,max_mem_size = "512g")
x <- h2o.importFile('test2.svmlight', parse = TRUE)
Here is my system:
openjdk version "1.8.0_121"
OpenJDK Runtime Environment (build 1.8.0_121-b13)
OpenJDK 64-Bit Server VM (build 25.121-b13, mixed mode)
Starting H2O JVM and connecting: .. Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 seconds 76 milliseconds
H2O cluster version: 3.14.0.3
H2O cluster version age: 1 month and 8 days
H2O cluster name: H2O_started_from_R_ra2816_fhv677
H2O cluster total nodes: 1
H2O cluster total memory: 455.11 GB
H2O cluster total cores: 24
H2O cluster allowed cores: 24
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: XGBoost, Algos, AutoML, Core V3, Core V4
R Version: R version 3.4.1 (2017-06-30)
I would appreciate any advice because I really enjoy H2O and would like to use it for this project.