0
votes

I have a unweighted graph with 3 million vertices. I want to find average of shortest path for each vertex.

I have tried using igraph for a sample graph of 3000 nodes with the following code:

 N <- gorder(G)
 spathAvg <- lapply(V(G),
             function(v){
              q <-   shortest.paths(G, v )
              rowSums(q*is.finite(q),na.rm = TRUE)/N

            })

and it works fine. However, calculating average shortest path for 1000 of 3 million vertices take around 10 minutes. Calculating for all 3 million vertices would take a lot of time. I need help in calcualting average shortest path for each vertex of 3 million vertices in a fast and efficient way.

1

1 Answers

1
votes

It might be faster to call average.path.length(G). Docs here: http://cneurocvs.rmki.kfki.hu/igraph/doc/R/shortest.paths.html

EDIT: Sorry about that confusion. I think there are two general strategies you can use here:

  1. Parallelize this.
  2. Calculate shortest paths against a sampling of nodes instead of against the entire graph and estimate the average shortest path from as the mean of these paths.

I experimented with this a bit just on my laptop and observed a significant improvement in performance from parallelization and a smaller improvement from subsampling (70% followed by an additional 10%, to a total of around 72% improvement on a 5000 node barabasi graph using my laptop). Additionally, the average % sampling error falls quickly and there isn't much change between 5% and 10% sampling (this may be a consequence of my choice of barabasi graph for this experiment). If you have access to an HPC cluster, either or both of these strategies should be well suited to being mapped out to separate workers.

Here's your code modified to use parallelization and downsampling:

library(igraph)
library(foreach)
library(doParallel)

stopCluster(cl)
cl <- makeCluster(8)
registerDoParallel(cl)

sampled_est_par <- function(G, perc_subsample=.1, max_samples=1e3){
  N <- gorder(G)
  m <- floor(perc_subsample*N)
  m <- ifelse(m>max_samples, max_samples, m)
  foreach(v=1:N) %dopar% {
    q <-   igraph::shortest.paths(G, v, sample(N, m))
    rowSums(q*is.finite(q),na.rm = TRUE)/m
  }
}

And here's my experimentation code: https://gist.github.com/dmarx/80b4d093bdcab2fff97ee0da2968084f