I am trying to set up a sparse matrix (dok_matrix) of journal co-occurences. Unfortunately, my solution is (too) inefficient to be of any use and I couldn't find any solution online.
EDIT: I would also like to create the sparse matrix directly, not by first creating a dense matrix and then turning it into a sparse matrix.
I start with a dataframe of how often certain journal are cited together. In this example, Nature and Science are cited together 3 times. I would like to end up with a sparse, symmetric matrix where the rows and columns are journals and the non-empty entries are how often these journals are cited together. I.e., here the full matrix would have four rows (Lancet, Nature, NEJM, Science) and four columns (Lancet, Nature, NEJM, Science) and three non-zero entries. Since my real data is much larger, I would like to use a sparse matrix representation.
What I currently do in my code is to update the non-zero entries with the values from my Dataframe. Unfortunately, the comparison of journal names is quite time-consuming and my question is, whether there is a quicker way of setting up a sparse matrix here.
My understanding is that my dataframe is close to a dok_matrix anyways, with the journal combination being equivalent to the tuple used as a key in the dok_matrix. However, I do not know how to make this transformation.
Any help is appreciated!
# Import packages
import pandas as pd
from scipy.sparse import dok_matrix
# Set up dataframe
d = {'journal_comb': ['Nature//// Science', 'NEJM//// Nature', 'Lancet//// NEJM'], 'no_combs': [3, 5, 6], 'journal_1': ['Nature', 'NEJM', 'Lancet'], 'journal_2': ['Science', 'Nature', 'NEJM']}
df = pd.DataFrame(d)
# Create list of all journal titles
journal_list = list(set(set(list(df['journal_1'])) | set(list(df['journal_2']))))
journal_list.sort()
# Set up empty sparse matrix with final size
S = dok_matrix((len(journal_list), len(journal_list)))
# Loop over all journal titles and get value from Dataframe for co-occuring journals
# Update sparse matrix value with value from Dataframe
for i in range(len(journal_list)):
print i
# Check whether journal name is actually in column 'journal_1'
if len(df[(df['journal_1'] == journal_list[i])]) > 0:
for j in range(len(journal_list)):
# If clause to circumvent error due to empty series if journals are not co-cited
if len(df[(df['journal_1'] == journal_list[i]) & (df['journal_2'] == journal_list[j])]['no_combs']) == 1:
# Update value in sparse matrix
S[i, j] = df[(df['journal_1'] == journal_list[i]) & (df['journal_2'] == journal_list[j])]['no_combs'].iloc[0]