1
votes

I am trying to create a huge network using the igraph python module. I am iterating trough a list of dictionaries in the following format:

d1={'el1':2, 'el3':4, ...,'el12':32}
d2={'el3':5, 'el4':6, ...,'el12':21}

The network is created in the following way: every node is one of the keys of the dictionaries that has an attribute that represents the sum of all the values of the node (for example, it would be 9 for el3 considering the two given dictionaries), and there is an edge between two nodes if they appear together in the same dictionary, with a weight attribute equal to the number of times they appear together (for instance it would be 2 for el3 and el12, as they appear together in 2 dictionaries).

I am using the following loop to create the network, where 'item' is a dictionary as the one described before. To be clear I have about 12.000 elements to analyze

g = ig.Graph()


for el in news_db:
    item = dict(news_db.get(el))['commenters']
    print count 
    count = count + 1
    for user in item:


        try:
            g.vs.find(user)['comment'] = g.vs.find(user)['comment'] + 1
        except:
            g.add_vertex(user)
            g.vs.find(user)['comment'] = 1

    for source, target in itertools.combinations(item.keys(), 2):
        source_id = g.vs.find(source).index
        target_id = g.vs.find(target).index
        if g.are_connected(source_id,target_id):
            edge_id = g.get_eid(source_id,target_id)
            g.es[edge_id]['weight'] = g.es[edge_id]['weight'] + 1 
        else:
            g.add_edge(source_id,target_id,weight=1)

The problem is that the speed of this procedure is really slow. It takes about 23 seconds to cycle trough the first 25 elements and the loop execution time gets worse as time passes. I have used a profiling tool and I've found out that 97% of the time is spent in the 'add_edge' function. Am I using igraph at its best? Is there the possibility of lowering this execution time?

To be clear, I have also an alternative networkx version, that takes about 3 minutes the create the graph. In that case the problem is that the procedure for saving the graph to disk takes too much memory and my laptop freezes. Besides I though that analyzing the graph with networkx would be really slow, given its pure Python implementation, so I decided to switch directly to igraph to solve both problems.

1

1 Answers

5
votes

Look here for reasons why add_edge is so slow.

Moreover, it seems that you do things very unefficently. It would be better to collect all necessary data before instantiating a graph, instead of performing so many updates. There is a collections.Counter class for those purposes:

import collections
import itertools

news_db = [{'el1':2, 'el3':4, 'el12':32},
           {'el3':5, 'el4':6, 'el12':21}]

vertices = collections.Counter()
edges = collections.Counter()

for item in news_db:
    vertices.update(**item)
    edges.update(itertools.combinations(item.keys(), 2))

print vertices 
print edges

outputs desired vertices and edges sets

Counter({'el12': 53, 'el3': 9, 'el4': 6, 'el1': 2})
Counter({('el3', 'el12'): 2, ('el3', 'el4'): 1, ('el3', 'el1'): 1, ('el1', 'el12'): 1, ('el12', 'el4'): 1})

and you can instantiate graph using them