Efficiently constructing sparse biadjacency matrix in Numpy

Question

I'm trying to load this CSV file into a sparse numpy matrix, which would represent the biadjacency matrix of this user-to-subreddit bipartite graph: http://figshare.com/articles/reddit_user_posting_behavior/874101

Here's a sample:

603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww

There are 876,961 lines (one per user) and 15,122 subreddits and a total of 8,495,597 user-to-subreddit associations.

Here's the code which I have right now, and which takes 20 minutes to run on my MacBook Pro:

import numpy as np
from scipy.sparse import csr_matrix 

row_list = []
entry_count = 0
all_reddits = set()
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for x in f:
        pieces = x.rstrip().split(",")
        user = pieces[0]
        reddits = pieces[1:]
        entry_count += len(reddits)
        for r in reddits: all_reddits.add(r)
        row_list.append(np.array(reddits))

reddits_list = np.array(list(all_reddits))

# 5s to get this far

rows = np.zeros((entry_count,))
cols = np.zeros((entry_count,))
data =  np.ones((entry_count,))
i=0
user_idx = 0
for row in row_list:
    for reddit_idx in np.nonzero(np.in1d(reddits_list,row))[0]:
        cols[i] = user_idx
        rows[i] = reddit_idx
        i+=1
    user_idx+=1
adj = csr_matrix( (data,(rows,cols)), shape=(len(reddits_list), len(row_list)) )

It seems hard to believe that this is as fast as this can go... Loading the 82MB file into a list of lists takes 5s but building out the sparse matrix takes 200 times that. What can I do to speed this up? Is there some file format that I can convert this CSV into in less than 20min that would import more quickly? Is there some obviously-expensive operation I'm doing here that's not good? I've tried building a dense matrix and I've tried creating a lil_matrix and a dok_matrix and assigning the 1's one at a time, and that's no faster.

What's the time fot the double for loop? The csr call? I'd first try to vectorize the inner for loop, that is, assign multiple values at once. — hpaulj

nicolaskruchten nicolaskruchten · Accepted Answer · 2014-11-27T05:39:26

Couldn't sleep, tried one last thing... I was able to get it down to 10 seconds this way, in the end:

import numpy as np
from scipy.sparse import csr_matrix 

user_ids = []
subreddit_ids = []
subreddits = {}
i=0
with open("reddit_user_posting_behavior.csv", 'r') as f:
    for line in f:
        for sr in line.rstrip().split(",")[1:]: 
            if sr not in subreddits: 
                subreddits[sr] = len(subreddits)
            user_ids.append(i)
            subreddit_ids.append(subreddits[sr])
        i+=1

adj = csr_matrix( 
    ( np.ones((len(userids),)), (np.array(subreddit_ids),np.array(user_ids)) ), 
    shape=(len(subreddits), i) )

Efficiently constructing sparse biadjacency matrix in Numpy

2 Answers