0
votes

I'm trying to output a numpy ndarray as a CSR file (this is an intermediate stage, I'm trying to use a program that requires CSR format as input).

So far, I've tried using scipy.sparse.coo_matrix() and writing out to an ijv file with the following code:

pca_coo = scipy.sparse.coo_matrix(pca_result)
with open(project + '/matrix/for_knn.jiv', 'w') as f:
    for row, col, val in zip(pca_coo.row, pca_coo.col, pca_coo.data):
        t= f.write("{}\t{}\t{}\n".format(row, col, val))

The file produced by the above code causes the program downstream to segfault.

I'm assuming at this point that the problem is in the format of the output, but I haven't been able to locate the issue.

Edit: Answered below.

1
I don't know of any standard way to store a csr matrix in a text file. You have a coordinate matrix format and it sounds like the problem is in some other program. Also I don't know how you have a runtime that long for anything in sklearn.neighbors without hitting a memory wall.CJR
What's this csr format that your code expects?hpaulj
I am not sure, but did you check if the zipfile is open correctly and the data are readbale? As I searched you should use 'from zipfile import ZipFile'.Reihaneh Kouhi
I have lots of memory to work with. I've been surprised by the poor performance as well. I've edited to show a solution a friend proposed, which seems to work alright. link See link for Compressed Sparse Row format. For reference, the same dataset that took 30 hours to run in python takes 3 minutes in faster implementation (using 48 threads instead of single python thread, but still... big improvement).Brendan O'Connell
Try setting n_jobs=-1 so scikit will parallelize next time.CJR

1 Answers

0
votes

A friend helped me with the following:

def write_csr(C, outputname):
    """
    writes out a csr matrix to a file (for use with l2knng).
    C = csr matrix
    outputName = output file name
    """
    with open(outputname, 'a') as OUTFILE:
        for i in range(0,C.shape[0]):
            sub = C[i,]
            outstr = ""
            for j in range(0,sub.size):
                outstr += " " + str(sub.indices[j]+1) + " " + str(sub.data[j])
            outstr += "\n"
            _ = OUTFILE.write(outstr)

Seems to work well.