1
votes

Would like to know if there is a more efficient way to load file content into a sparse matrix. The following code reads from a big file (8GB), which has mostly zero values (very sparse), and then does some processing on each line read. I would like to perform arithmetic operations on it efficiently, so I try to store the lines as a sparse matrix. Since the number of lines in file is not known in advance, and array/matrix are not dynamic, I have to first store it in a list and then transform is to a csr_matrix. This phase ("X = csr_matrix(X)") takes a lot of time and memory.
Any suggestions?

import numpy as np
from scipy.sparse import csr_matrix
from datetime import datetime as time

global header_names; header_names = []

def readOppFromFile(filepath):

    print "Read Opportunities From File..." + str(time.now())

    # read file header - feature names separated with commas
    global header_names

    with open(filepath, "r") as f:

        i=0

        header_names  = f.readline().rstrip().split(',')

        for line in f: 


            # replace empty string with 0 in comma-separated string. In addition, clean null values (replace with 0)
            yield [(x.replace('null', '0') if x else 0) for x in line.rstrip().split(',')]
            i += 1

        print "Number of opportunities read from file: %s" % str(i) 

def processOpportunities(opp_data):

    print "Process Opportunities ..." + str(time.now())

    # Initialization 
    X = []
    targets_array = []

    global header_names

    for opportunity in opp_data:

        # Extract for each opportunity it's target variable, save it in a special array and then remove it  
        target = opportunity[-1] # Only last column
        targets_array.append(target)
        del opportunity[-1] # Remove last column

        X.append(opportunity)     

   print " Starting to transform to a sparse matrix" + str(time.now())
    X = csr_matrix(X)
    print "Finished transform to a sparse matrix " + str(time.now())

    # The target variable of each impression
    targets_array = np.array(targets_array, dtype=int)
    print "targets_array" + str(time.now())        

    return X, targets_array

def main():


    print "STRAT -----> " + str(time.now())
    running_time = time.now()

    opps_data = readOppFromFile(inputfilename)

    features, target = processOpportunities(opps_data)

if __name__ == '__main__':

    """ ################### GLOBAL VARIABLES ############################ """     
    inputfilename = 'C:/somefolder/trainingset.working.csv'

    """ ################### START PROGRAM ############################ """     
    main()

Updated: The dimensions of the matrix are not constant, they depend on the input file and may change in each run of the program. For a small sample of my input, see here.

1
What determines the bounds of your sparse matrix? Just the number of lines in the file? Also can you share a link to a very small version of your giant file so that anyone can reproduce and test?KobeJohn
This dimensions are set by the input file, but it may change in each run. See a sample version of my input file here: github.com/nancyya/Public/blob/master/…Serendipity
Thanks. I'll see if I can work something out. I've always wanted to try something with sparse matrices in numpy. However, can you check that the data file works with the code above? I get ValueError: invalid literal for int() with base 10: 'da7f5cb5-2189-40cc-8a42-9fdedc29f925'KobeJohn
oh, because I have a function I omitted from the code here that takes from each opportunity (before doing "for opportunity in opp_data") only numeric values.Serendipity

1 Answers

3
votes

You can construct a sparse matrix directly, if you keep track of the nonzeros manually:

X_data = []
X_row, X_col = [], []
targets_array = []

for row_idx, opportunity in enumerate(opp_data):
    targets_array.append(int(opportunity[-1]))
    row = np.array(map(int, opportunity[:-1]))
    col_inds, = np.nonzero(row)
    X_col.extend(col_inds)
    X_row.extend([row_idx]*len(col_inds))
    X_data.extend(row[col_inds])

print " Starting to transform to a sparse matrix" + str(time.now())
X = coo_matrix((X_data, (X_row, X_col)), dtype=int)
print "Finished transform to a sparse matrix " + str(time.now())

This constructs the matrix in COO format, which is easy to transform into whatever format you like:

X = X.tocsr()