I'm constructing large directional graphs (using igraph, from R) and have discovered a strange issue in which vertices are apparently duplicated for certain vertex names. This issue doesn't happen in small graphs, and the issue only appears to arise when the vertex names reach 1e+05. There is a clear regularity to the vertices that get duplicated. To jump ahead, the vertex duplication looks like this (generated in section 2 of the below code):
name_vertex id_cell id_vertex
1: 100000 100000 97355
2: 1e+05 100000 1435205
3: 200000 200000 197106
4: 2e+05 200000 1435206
5: 400000 400000 396605
6: 4e+05 400000 1435207
7: 500000 500000 496356
8: 5e+05 500000 1435208
9: 700000 700000 695855
10: 7e+05 700000 1435209
11: 800000 800000 795606
12: 8e+05 800000 1435210
13: 1000000 1000000 995105
14: 1e+06 1000000 1435211
The duplication occurs when 1e+05 is reached and then duplicates are generated for that and every subsequent vertex that is xe+0n where x is in 1:9 and n is >=5 (note that in this graph there is no 3e+05 valued vertex by construction- it lies on the matrix margin- and this is why it isn't present).
All x0.. versions of the vertices hold the outgoing edges, while the xe+0.. versions hold the incoming edges.
Reproducible example: (note: the way in which I generate the adjacency dataframe owes more to the pipeline I've been using to generate graphs for my use case. The issue could presumably be generated more directly).
The below code generates a matrix, identifies each cell's adjacencies and then constructs a graph from these. Cells at the matrix margin are assigned 0 values to remove them from the adjacency table (to prevent wrapping round the edges).
There are three sections:
(1) running for matrix dimension 100x100: correct behaviour
(2) running for matrix dimension 1200x1200: duplication
(3) unpacking the duplication issue
NOTE: it takes 30 seconds or so and 3-4GB RAM to produce the graph in (2)
# packages
library(data.table); library(igraph)
# function to get adjacent cells in a matrix
get_adjacent <- function(cells, n_row, n_col) {
adjacencies_i <- c(cells-n_row - 1,
cells-n_row,
cells-n_row+1,
cells-1,
cells+1,
cells+n_row-1,
cells+n_row,
cells+n_row+1)
return(adjacencies_i)
}
# function to get the margins of a matrix (i.e. 1-deep outer margin of cells)
get_margins <- function(matrix) {
dims <- dim(matrix)
bottom_right <- prod(dims)
top_right <- (bottom_right - dims[1])
c(1:dims[1], # first column
top_right:bottom_right, # last column
seq(1, top_right, dims[1]), # top row
seq(dims[1], bottom_right, dims[1])) # bottom row
}
# (1) Before creating the failure case, produce a much smaller graph that
# has the correct behaviour
# generate a matrix of 1-valued cells
test_mat <- matrix(1, ncol=100, nrow=100)
# remove edge cells to prevent the adjacencies wrapping around the edges
test_mat[get_margins(test_mat)] <- 0
# plot: all black cells are those that should be represented in the graph, and
# each of these cells should each be linked to their immediately adjacent neighbours
# (including diagonals - see get_adjacent function)
image(test_mat, asp=1, col=c("red", "black"))
# calculate the adjacency dataframe to calculate a graph from
permitted_cells <- which(test_mat[] == 1)
n_row <- dim(test_mat)[1]
n_col <- dim(test_mat)[2]
# full set of adjacencies
adj <- data.table(from = rep(permitted_cells, (1*2 + 1)^2 - 1),
to = get_adjacent(permitted_cells, n_row, n_col))
# remove those that are 0-valued
adj_permitted <- adj[to %in% permitted_cells,]
# calculate graph
g <- graph_from_data_frame(adj_permitted[,list(from, to)], directed = T)
# get vertex names
vertex_names <- names(V(g))
graph_vertices <- data.table(name_vertex = vertex_names,
id_cell = as.integer(vertex_names),
id_vertex = 1:length(vertex_names))
setorder(graph_vertices, id_cell)
# looks good: same number of vertices in graph as there are 1-valued cells in the
# original matrix
print(paste0("n_vertices: ", nrow(graph_vertices)))
print(paste0("n_cells: ", sum(test_mat)))
## (2) failure case. Code is identical to the above, save for the dimensions of
## the matrix being much larger (1200 rather than 100), and the image() function
## is commented out.
# generate a matrix of 1-valued cells
test_mat <- matrix(1, ncol=1200, nrow=1200)
# remove edge cells to prevent the adjacencies wrapping around the edges
test_mat[get_margins(test_mat)] <- 0
# plot: all black cells are those that should be represented in the graph, and
# each of these cells should each be linked to their immediately adjacent neighbours
# (including diagonals - see get_adjacent function)
# image(test_mat, asp=1, col=c("red", "black"))
# calculate the adjacency dataframe to calculate a graph from
permitted_cells <- which(test_mat[] == 1)
n_row <- dim(test_mat)[1]
n_col <- dim(test_mat)[2]
# full set of adjacencies
adj <- data.table(from = rep(permitted_cells, (1*2 + 1)^2 - 1),
to = get_adjacent(permitted_cells, n_row, n_col))
# remove those that are 0-valued
adj_permitted <- adj[to %in% permitted_cells,]
# calculate graph
g <- graph_from_data_frame(adj_permitted[,list(from, to)], directed = T)
# get vertex names
vertex_names <- names(V(g))
graph_vertices <- data.table(name_vertex = vertex_names,
id_cell = as.integer(vertex_names),
id_vertex = 1:length(vertex_names))
setorder(graph_vertices, id_cell)
# there are 7 more vertices than there are 1-valued cells
print(paste0("n_vertices: ", nrow(graph_vertices)))
print(paste0("n_cells: ", sum(test_mat)))
print(paste0("n_extra_vertices: ", nrow(graph_vertices) - sum(test_mat)))
# (3) What are these extra vertices?
# get duplicated vertices
duplicated_vertices <-
graph_vertices[id_cell %in% graph_vertices[duplicated(id_cell),id_cell]]
setorder(duplicated_vertices, id_cell, id_vertex)
# the 7 additional vertices arise through duplication
nrow(duplicated_vertices)
print(duplicated_vertices)
# xe+.. version has the incoming edges
incoming <- adjacent_vertices(g, duplicated_vertices$id_vertex, mode="in")
incoming[unlist(lapply(incoming, function(x) length(x) != 0))]
# x0.. version has outgoing edges
outgoing <- adjacent_vertices(g, duplicated_vertices$id_vertex, mode="out")
outgoing[unlist(lapply(outgoing, function(x) length(x) != 0))]
To (finally) get to the question. What is going on here? Is there something I can do to prevent this behaviour? The workaround I currently have is to take the incoming edges that are received by the xe+0.. version and add edges to these vertices for the x0.. version, before deleting the xe+0.. version of the vertex.