0
votes

I got a [210,000 x 500] sparse matrix in R which i'm trying to cluster using h2o. I imagined that a 210,000 row matrix is not that large for h2o, but when I try to import it to h2o instance it takes a very long time (let it run over 10 minutes and stopped it before completion) when I subset the first 10,000 rows in a sparse matrix and import it, it takes only a few seconds. and i've tried doing it incrementally and it takes a long time. (by 60,000 I stopped) Is this normal or I'm doing something wrong?

here's what i'm using

library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "16g")     
spmx.h2o <- as.h2o(sparse_mx)

Below is more info about the h2o instance when it's generated:

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Starting H2O JVM and connecting: . Connection successful!

    R is connected to the H2O cluster: 
        H2O cluster uptime:         6 seconds 779 milliseconds 
        H2O cluster version:        3.10.4.6 
        H2O cluster version age:    1 month and 30 days  
        H2O cluster name:           H2O_started_from_R_M_vto433 
        H2O cluster total nodes:    1 
        H2O cluster total memory:   14.22 GB 
        H2O cluster total cores:    4 
        H2O cluster allowed cores:  4 
        H2O cluster healthy:        TRUE 
        H2O Connection ip:          localhost 
        H2O Connection port:        54321 
        H2O Connection proxy:       NA 
        H2O Internal Security:      FALSE 
        R Version:                  R version 3.4.0 (2017-04-21) 

I'm trying to avoid writing the matrix to file and import again, simply because I think 210,000 rows and 500 columns should not be a problem for h2o to handle

1
Found several SO answers that appearred to address the aspect of speed for data transfer. Here's one: stackoverflow.com/questions/41477700/… I would search for others and then say which ones you attempted and why they didn't solve your issues.IRTFM
I have opened a JIRA ticket and we are looking into the issue: 0xdata.atlassian.net/browse/PUBDEV-4630Erin LeDell

1 Answers

0
votes

It seems it's not really possible to import sparse matrices of slightly larger size into h2o instance directly through R at the moment. Instead, importing through a SVMLight file is much faster. as discussed here

How to get sparse matrices into H2O?


Edit: in search of converting sparse matrix file to SVMLight format (efficient and fast algorithm) I tried using laurai2/sparsity package for efficient conversion of sparse matrix to SVMLgith format file. But i couldn't get the package installed due to some Cpp compilation error. Based on @Dmitriy Selivanov suggestion, i used sparsio package and could easily convert the sparse matrix into SVMLight format and import it quickly to h2o.

## The following works
library(sparsio)
library(h2o)
write_svmlight(x = spmx, file = "spmx_svmlight.txt", zero_based = FALSE) #h2o accepts one_based by default

localH2O <- h2o.init(nthreads = -1, max_mem_size = "16g")  
spmx.h2o <- h2o.importFile("spmx_svmlight.txt", parse = TRUE) 

My dataset size is still fairly small and I'm not sure how well write_svmlight will work on much larger datasets. It took my data about 40 seconds, which is OK.