meaning of weights in community detection algorithms

Question

There is an excellent comparison of community detection algorithms available in igraph here. However, there's some ambiguity about the use of weights in the algorithms that can be applied with weighted edges.

Typically, edge weights will be oriented so that higher weights suggest keeping the nodes together (eg strength of friendship). This works nicely with modularity scores by comparing average weighted density within and externally.

However, the Newman-Girvan community detection algorithm uses betweenness, which is based on distances. In this case, I would expect that the edge weights should reflect the distance between nodes so that calculating shortest paths sums the weights over the path. That is, the weight is a cost or distance score, where higher values should break into different communities.

Am I correct in expecting higher weights for greater distances when using Newman-Girvan and, if so, how then does this reconcile with using modularity to decide where to cut the number of communities?

Julius Vainora Julius Vainora · Accepted Answer · 2018-03-16T00:08:11

My answer is going to be based on the igraph package in R. The situation is indeed quite confusing and the questions are relevant since, as Newman (2004) says,

Since the publication of that work, the author has been asked a number of times whether an appropriate generalization of the algorithm exists for weighted networks.

In his paper he derives an appropriate generalization of the Newman-Girvan algorithm to weighted networks.

Weights

You are right about the interpretation of weights in the Newman-Girvan algorithm. edge_betweenness uses a formula analogous to that in (Brandes, 2001), where the length of a path is defined as the sum of the weights of its edges. (You may also check the source code but it's quite involved). In ?edge_betweenness and, in particular, ?cluster_edge_betweenness it says

Edge weights are used to calculate weighted edge betweenness. This means that edges are interpreted as distances, not as connection strengths.

The implications are as follows. Let b(e, w) be the edge betweenness of an edge e with weight w. Then it can be shown (I could elaborate if you wish) that

b(e, w) <= b(e, w*) if and only if w >= w*.

That is, the edge betweenness and the weight of e are inversely related. The main idea is that given, e.g., w* >> w, those shortest paths that were crossing e now are likely to be dominated by some other paths that do not include e. Hence, larger weight implies (weakly) lower betweenness, and lower betweenness makes it less likely that e will be recognized as an edge connecting two communities. Thus, this sounds strange if we see weights as distances. On the other hand, if e is within some community and we decrease its weight, then the number of shortest paths through that edge potentially increases and it becomes more likely to be seen as connecting two communities. I am not yet claiming anything about the corresponding modularity scores, though.

Now let's suppose that weights actually correspond to connection strengths. Then the stronger the connection is, the fewer shortest paths will go through that edge (because we still need to compute them), the lower its edge betweenness is, and the less likely it is to be removed. So that kind of makes sense.

What is not nice, or rather strange, is that now the length of a path is defined as the sum of its connection strengths. However, we can reinterpret the algorithm then. Suppose that the weights are >> 1 within communities and << 1 between them. Then we can interpret the length of a path as the privacy of this path (e.g., a path within a community would contain lots of close interactions, while the edge connecting two communities is somewhat public, open). Given such an interpretation, the algorithm would look for the least private / the most open paths and compute the corresponding betweenness. Then we would be removing such edges that belong to many most open paths.

So perhaps I made a mistake somewhere, but it looks like it would make more sense to see weights as connection strengths.

Newman (2004) does something related:

...we will consider specifically those networks in which the weights on edges take greater values for vertex pairs that have closer connections or are more similar in some way.

It would seem that it should make sense. However, as to keep the more natural definition of the shortest path he writes:

One can define paths on a weighted network by assuming the “length” of an edge to vary inversely with its weight, so that two vertices that are connected twice as strongly will be half as far apart.

That is, the shortest path lengths now are inversely related to the weights. Since not doing that seemed to give good results, now we have a problem:

To see this, notice that any two vertices that are particularly strongly connected to one another will have a particularly short distance along the edge between them. Geodesic paths will thus, all other things being equal, prefer to flow along such an edge than along another longer edge between two less well connected vertices, and hence closely connected pairs will tend to attract a lot of paths and acquire high betweenness. This means that, as a general rule, we are more likely to remove edges between well connected pairs than we are between poorly connected pairs, and this is the precise opposite of what we would like the algorithm to do.

Which is the result that I described when we see weights as distances. As I mentioned in the beginning of the answer, to deal with this Newman (2004) proposes mapping weighted graphs to unweighted multigraphs and then proceeding very similarly as in the standard case. I believe that this multigraph idea can be implemented by setting weighted = NULL but having not a binary adjacency matrix (when defining a graph; see weighted in ?graph_from_adjacency_matrix).

Modularity

First of all, one can use modularity with weighted graphs, as Newman (2004) does, that's not a problem. In general, it's so not obvious how using weights affects using modularity as the way to choose the number of communities. I'll perhaps add some examples with R. It seems like there should be an improvement over the unweighted case, as Newman (2004) finds, when the interpretation is in line with the way algorithm works. Otherwise, I think the graph structure and the weights itself can be quite important to describe the degree of how far from the truth we get.

References

Newman, M.E., 2004. Analysis of weighted networks. Physical review E, 70(5).

Brandes, U., 2001. A faster algorithm for betweenness centrality. Journal of mathematical sociology, 25(2), pp.163-177.

meaning of weights in community detection algorithms

1 Answers

Weights

Modularity