75
votes

I need advice for rendering an undirected graph with 178,000 nodes and 500,000 edges. I've tried Neato, Tulip, and Cytoscape. Neato doesn't even come remotely close, and Tulip and Cytoscape claim they can handle it but don't seem to be able to. (Tulip does nothing and Cytoscape claims to be working, and then just stops.)

I'd just like a vector format file (ps or pdf) with a remotely reasonable layout of the nodes.

17
Draw a small square and color it all black. :-) Sorry, I couldn't resist.tvanfosson
What kind of data does this graph represent? Maybe, you can simplify it automatically? It's just my guess: I have no info on the data represented, so it's hard to guess. Anyway, so many nodes and edges will not be very expressive on one piece of paper...avp
Just how big a PDF are you expecting - something tiled over several sheets of A3?Andy Dent
@ Andy Dent, or several hundred...Tom Neyland
You probably need to increase cytoscape's heap: cytoscape.wodaklab.org/wiki/…Ron

17 Answers

29
votes

Graphviz itself provides a solution for rendering large graphs.

Namely, Graphviz includes sfdp, a multiscale version of fdp (also in graphviz, similar to neato) for the layout of large undirected graphs which has been useful for drawing large graphs (70k nodes, 500k edges) in my project.

You can find documentation for this software on the graphviz web site itself at http://www.graphviz.org/

More information, a paper describing the underlying techniques and examples can be found here: http://yifanhu.net/PUB/graph_draw_small.pdf

21
votes

I suggest that you first do some preprocessing of the data, for example collapsing nodes to clusters and then visualizing the clusters. Collapsing will reduce the number of nodes and makes it easier for algorithms such as Kamada-Kawai or Fruchterman-Reingold to render the resulting graph.

If you really need to visualize 500.000 nodes then can you consider using a simple circular layout. This will be easy to render without the issues that force-based algorithms have. Take a look at Circos: http://mkweb.bcgsc.ca/circos/

Circos is graph visualization developed by bio-informatics people which is tailored to visualize genomes and other extremely large and complex data-sets.

It's a PERL based package, I hope that's not problematic.

20
votes

I've had good results using the graph-tool library in python. The below graph has 1,490 nodes and 19,090 edges - it took around 5min to render on my laptop.

political blogging network

The graph data comes from the political blogging network described by Adamic and Glance in “The political blogosphere and the 2004 US Election” pdf link here. If you zoom in you can see the blog urls for each node.

zoomed

Here's the code I used to draw it (blog http://ryancompton.net/2014/10/22/stochastic-block-model-based-edge-bundles-in-graph-tool/ ):

import graph_tool.all as gt
import math

g = gt.collection.data["polblogs"] #  http://www2.scedu.unibo.it/roversi/SocioNet/AdamicGlanceBlogWWW.pdf
print(g.num_vertices(), g.num_edges())

#reduce to only connected nodes
g = gt.GraphView(g,vfilt=lambda v: (v.out_degree() > 0) and (v.in_degree() > 0) )
g.purge_vertices()

print(g.num_vertices(), g.num_edges())

#use 1->Republican, 2->Democrat
red_blue_map = {1:(1,0,0,1),0:(0,0,1,1)}
plot_color = g.new_vertex_property('vector<double>')
g.vertex_properties['plot_color'] = plot_color
for v in g.vertices():
    plot_color[v] = red_blue_map[g.vertex_properties['value'][v]]

#edge colors
alpha=0.15
edge_color = g.new_edge_property('vector<double>')
g.edge_properties['edge_color']=edge_color
for e in g.edges():
    if plot_color[e.source()] != plot_color[e.target()]:
        if plot_color[e.source()] == (0,0,1,1):
            #orange on dem -> rep
            edge_color[e] = (255.0/255.0, 102/255.0, 0/255.0, alpha)
        else:
            edge_color[e] = (102.0/255.0, 51/255.0, 153/255.0, alpha)            
    #red on rep-rep edges
    elif plot_color[e.source()] == (1,0,0,1):
        edge_color[e] = (1,0,0, alpha)
    #blue on dem-dem edges
    else:
        edge_color[e] = (0,0,1, alpha)

state = gt.minimize_nested_blockmodel_dl(g, deg_corr=True)
bstack = state.get_bstack()
t = gt.get_hierarchy_tree(bstack)[0]
tpos = pos = gt.radial_tree_layout(t, t.vertex(t.num_vertices() - 1), weighted=True)
cts = gt.get_hierarchy_control_points(g, t, tpos)
pos = g.own_property(tpos)
b = bstack[0].vp["b"]

#labels
text_rot = g.new_vertex_property('double')
g.vertex_properties['text_rot'] = text_rot
for v in g.vertices():
    if pos[v][0] >0:
        text_rot[v] = math.atan(pos[v][1]/pos[v][0])
    else:
        text_rot[v] = math.pi + math.atan(pos[v][1]/pos[v][0])

gt.graph_draw(g, pos=pos, vertex_fill_color=g.vertex_properties['plot_color'], 
            vertex_color=g.vertex_properties['plot_color'],
            edge_control_points=cts,
            vertex_size=10,
            vertex_text=g.vertex_properties['label'],
            vertex_text_rotation=g.vertex_properties['text_rot'],
            vertex_text_position=1,
            vertex_font_size=9,
            edge_color=g.edge_properties['edge_color'],
            vertex_anchor=0,
            bg_color=[0,0,0,1],
            output_size=[4024,4024],
            output='polblogs_blockmodel.png')
12
votes

Try Gephi, it has a new layout plugin called OpenOrd that scales to millions of nodes.

4
votes

Mathematica could very likely handle it, but I have to admit my first reaction was along the lines of the comment that said "take a piece of paper and color it black." Is there no way to reduce the density of the graph?

A possible issue is that you seem to be looking for layout, not just rendering. I have no knowledge about the Big O characteristics of the layouts implemented by various tools, but intuitively I would guess that it might take a long time to lay out that much data.

3
votes

Does it need to be truly accurate?

Depending on what you're trying to accomplish it might be good enough to just graph 10% or 1% of the data volume. (of course, it might also be completely useless, but it all depends on what the visualization is for)

3
votes

BioFabric (www.BioFabric.org) is another tool for visualizing large graphs. It should be able to handle the network described (178,000 nodes and 500,000 edges) OK, though the initial layout may take awhile. The network show here (from the Stanford Large Network Dataset Collection) is the Stanford Web Network, which has 281,903 nodes and 2,312,497 edges:

Stanford Web Network BioFabric's scalability is due to the fact that it represents nodes not as points, but as horizontal lines. The edges are then shown as vertical lines. For some intuition about how this works, there is the Super-Quick BioFabric Demo, which is a small network that is animated using D3.

The primary application is written in Java. At the moment, it can only export PNG images, not PDFs. There is a PDF export option from RBioFabric, though that is a very simple implementation that cannot handle really large networks yet.

Full disclosure: BioFabric is a tool that I wrote.

2
votes

I expect edge clustering (http://www.visualcomplexity.com/vc/project_details.cfm?id=679&index=679&domain=) would help. This technique bundles related edges together, reducing the visual complexity of the graph. You may have to implement the algorithm yourself though.

1
votes

You might offer a sanitized version of the file to the developers of those tools as a debugging scenario, if all else fails.

1
votes
1
votes

Large Graph Layout (LGL) project helped me a lot with a similar ptoblem. It handles layout and have a small java app to draw produced layouts in 2D. No vector output out of the box so you'll have to draw the graph yourself (given the node coordinates produced by LGL)

0
votes

A windows tool that can visualize graphs is pajek, it generates a eps output, however I don't know if it can read your data.

0
votes

There's a list of apps here: http://www.mkbergman.com/?p=414

Walrus and LGL are two tools supposedly suited for large graphs. However, both seem to require graphs to be input as text files in their own special format, which might be a pain.

0
votes

I don't think you can come remotely close to visualising that in a flat layout.

I've been intrigued by Hyperbolic Graphs, described in this research paper for some time. Try the software from SourceForge.

Another idea is just graphing the nodes using a TreeMap as seen at Panopticode.

0
votes

You can also try NAViGaTOR (disclosure: I'm one of the developers for that software). We've successfully visualized graphs with as many as 1.7 million edges with it. Although such large networks are hard to manipulate (the user interface will get laggy). However, it does use OpenGL for the visualization so some of the overhead is transferred to the graphics card.

Also note that you'll have to crank up the memory settings in the File->Preferences dialog box before you can successfully open a network that big.

Finally, as most of the other responses point out, you are better off re-organizing your data into something smaller and more meaningful.

0
votes

First, I would like to second aliekens' suggestion to try sfdp. It is the large scale version of Neato.

As OJW suggests you could also just plot the nodes in R2. Your edges actually supply what he calls a "natural ordering." In particular you can plot the components of the second and third eigenvectors of the normalized graph Laplacian. This is the matrix L in this wikipedia page about spectral clustering. You should be able to write down this matrix without understanding the linear algebra behind it. Then, you have reduced your problem to approximately computing the first few eigenvectors of a large sparse matrix. This is traditionally done by iterative methods and is implemented in standard linear algebra packages. This method should scale up to very large graphs.