I'm trying to load this CSV file into a sparse numpy matrix, which would represent the biadjacency matrix of this user-to-subreddit bipartite graph: http://figshare.com/articles/reddit_user_posting_behavior/874101
Here's a sample:
603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww
There are 876,961 lines (one per user) and 15,122 subreddits and a total of 8,495,597 user-to-subreddit associations.
Here's the code which I have right now, and which takes 20 minutes to run on my MacBook Pro:
import numpy as np
from scipy.sparse import csr_matrix
row_list = []
entry_count = 0
all_reddits = set()
with open("reddit_user_posting_behavior.csv", 'r') as f:
for x in f:
pieces = x.rstrip().split(",")
user = pieces[0]
reddits = pieces[1:]
entry_count += len(reddits)
for r in reddits: all_reddits.add(r)
row_list.append(np.array(reddits))
reddits_list = np.array(list(all_reddits))
# 5s to get this far
rows = np.zeros((entry_count,))
cols = np.zeros((entry_count,))
data = np.ones((entry_count,))
i=0
user_idx = 0
for row in row_list:
for reddit_idx in np.nonzero(np.in1d(reddits_list,row))[0]:
cols[i] = user_idx
rows[i] = reddit_idx
i+=1
user_idx+=1
adj = csr_matrix( (data,(rows,cols)), shape=(len(reddits_list), len(row_list)) )
It seems hard to believe that this is as fast as this can go... Loading the 82MB file into a list of lists takes 5s but building out the sparse matrix takes 200 times that. What can I do to speed this up? Is there some file format that I can convert this CSV into in less than 20min that would import more quickly? Is there some obviously-expensive operation I'm doing here that's not good? I've tried building a dense matrix and I've tried creating a lil_matrix and a dok_matrix and assigning the 1's one at a time, and that's no faster.