3
votes

I have a large sparse matrix from scipy (300k x 100k with all binary values, mostly zeros). I would like to set the rows of this matrix to be an RDD and then do some computations on those rows - evaluate a function on each row, evaluate functions on pairs of rows, etc.

Key thing is that it's quite sparse and I don't want to explode the cluster - can I convert the rows to SparseVectors? Or perhaps convert the whole thing to SparseMatrix?

Can you give an example where you read in a sparse array, setup rows into an RDD, and compute something from the cartesian product of those rows?

2
Try using pyspark.Eli Sadoff
@EliSadoff I am using pyspark, the issue is I don't know which objects to use or how to set them up.cgreen
Ah, I didn't realize that. I thought you were trying to figure out how to get it from python to scala.Eli Sadoff

2 Answers

4
votes

I had this issue recently--I think you can convert directly by constructing the SparseMatrix with the scipy csc_matrix attributes. (Borrowing from Yang Bryan)

import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Matrices

# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6]) 
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))

# convert to pyspark SparseMatrix
sparse_matrix = Matrices.sparse(sv.shape[0],sv.shape[1],sv.indptr,sv.indices,sv.data)
2
votes

The only thing you have to is toarray()

import numpy as np
import scipy.sparse as sps

# create a sparse matrix
row = np.array([0, 2, 2, 0, 1, 2])
col = np.array([0, 0, 1, 2, 2, 2])
data = np.array([1, 2, 3, 4, 5, 6]) 
sv = sps.csc_matrix((data, (row, col)), shape=(3, 3))
sv.toarray()
> array([[1, 0, 4],
>       [0, 0, 5],
>       [2, 3, 6]])

type(sv)
<class 'scipy.sparse.csc.csc_matrix'>

#read sv as RDD
sv_rdd = sc.parallelize(sv.toarray())  #transfer saprse to array
sv_rdd.collect()
> [array([1, 0, 4]), array([0, 0, 5]), array([2, 3, 6])]

type(sv_rdd)
> <class 'pyspark.rdd.RDD'>