0
votes

I'm constructing large directional graphs (using igraph, from R) and have discovered a strange issue in which vertices are apparently duplicated for certain vertex names. This issue doesn't happen in small graphs, and the issue only appears to arise when the vertex names reach 1e+05. There is a clear regularity to the vertices that get duplicated. To jump ahead, the vertex duplication looks like this (generated in section 2 of the below code):

    name_vertex id_cell id_vertex
 1:      100000  100000     97355
 2:       1e+05  100000   1435205
 3:      200000  200000    197106
 4:       2e+05  200000   1435206
 5:      400000  400000    396605
 6:       4e+05  400000   1435207
 7:      500000  500000    496356
 8:       5e+05  500000   1435208
 9:      700000  700000    695855
10:       7e+05  700000   1435209
11:      800000  800000    795606
12:       8e+05  800000   1435210
13:     1000000 1000000    995105
14:       1e+06 1000000   1435211

The duplication occurs when 1e+05 is reached and then duplicates are generated for that and every subsequent vertex that is xe+0n where x is in 1:9 and n is >=5 (note that in this graph there is no 3e+05 valued vertex by construction- it lies on the matrix margin- and this is why it isn't present).

All x0.. versions of the vertices hold the outgoing edges, while the xe+0.. versions hold the incoming edges.


Reproducible example: (note: the way in which I generate the adjacency dataframe owes more to the pipeline I've been using to generate graphs for my use case. The issue could presumably be generated more directly).

The below code generates a matrix, identifies each cell's adjacencies and then constructs a graph from these. Cells at the matrix margin are assigned 0 values to remove them from the adjacency table (to prevent wrapping round the edges).

There are three sections:

(1) running for matrix dimension 100x100: correct behaviour

(2) running for matrix dimension 1200x1200: duplication

(3) unpacking the duplication issue

NOTE: it takes 30 seconds or so and 3-4GB RAM to produce the graph in (2)

# packages
library(data.table); library(igraph)

# function to get adjacent cells in a matrix
get_adjacent <- function(cells, n_row, n_col) {
    adjacencies_i <- c(cells-n_row - 1, 
                       cells-n_row, 
                       cells-n_row+1, 
                       cells-1, 
                       cells+1, 
                       cells+n_row-1, 
                       cells+n_row, 
                       cells+n_row+1)
    return(adjacencies_i)
}

# function to get the margins of a matrix (i.e. 1-deep outer margin of cells)    
get_margins <- function(matrix) {
    dims <- dim(matrix)
    bottom_right <- prod(dims)
    top_right <- (bottom_right - dims[1])
    c(1:dims[1], # first column 
      top_right:bottom_right, # last column
      seq(1, top_right, dims[1]), # top row
      seq(dims[1], bottom_right, dims[1])) # bottom row
}


# (1) Before creating the failure case, produce a much smaller graph that 
# has the correct behaviour

# generate a matrix of 1-valued cells
test_mat <- matrix(1, ncol=100, nrow=100)

# remove edge cells to prevent the adjacencies wrapping around the edges
test_mat[get_margins(test_mat)] <- 0

# plot: all black cells are those that should be represented in the graph, and 
# each of these cells should each be linked to their immediately adjacent neighbours 
# (including diagonals - see get_adjacent function)
image(test_mat, asp=1, col=c("red", "black"))

# calculate the adjacency dataframe to calculate a graph from
permitted_cells <- which(test_mat[] == 1)
n_row <- dim(test_mat)[1]
n_col <- dim(test_mat)[2]

# full set of adjacencies
adj <- data.table(from = rep(permitted_cells, (1*2 + 1)^2 - 1), 
                  to = get_adjacent(permitted_cells, n_row, n_col))
# remove those that are 0-valued
adj_permitted <- adj[to %in% permitted_cells,] 

# calculate graph
g <- graph_from_data_frame(adj_permitted[,list(from, to)], directed = T)

# get vertex names 
vertex_names <- names(V(g))
graph_vertices <- data.table(name_vertex = vertex_names, 
                             id_cell = as.integer(vertex_names), 
                             id_vertex = 1:length(vertex_names)) 
setorder(graph_vertices, id_cell)

# looks good: same number of vertices in graph as there are 1-valued cells in the 
# original matrix
print(paste0("n_vertices: ", nrow(graph_vertices)))
print(paste0("n_cells: ", sum(test_mat)))


## (2) failure case. Code is identical to the above, save for the dimensions of 
## the matrix being much larger (1200 rather than 100), and the image() function 
## is commented out. 

# generate a matrix of 1-valued cells
test_mat <- matrix(1, ncol=1200, nrow=1200)

# remove edge cells to prevent the adjacencies wrapping around the edges
test_mat[get_margins(test_mat)] <- 0

# plot: all black cells are those that should be represented in the graph, and 
# each of these cells should each be linked to their immediately adjacent neighbours 
# (including diagonals - see get_adjacent function)
# image(test_mat, asp=1, col=c("red", "black"))

# calculate the adjacency dataframe to calculate a graph from
permitted_cells <- which(test_mat[] == 1)
n_row <- dim(test_mat)[1]
n_col <- dim(test_mat)[2]

# full set of adjacencies
adj <- data.table(from = rep(permitted_cells, (1*2 + 1)^2 - 1), 
                  to = get_adjacent(permitted_cells, n_row, n_col))
# remove those that are 0-valued
adj_permitted <- adj[to %in% permitted_cells,] 

# calculate graph
g <- graph_from_data_frame(adj_permitted[,list(from, to)], directed = T)

# get vertex names 
vertex_names <- names(V(g))
graph_vertices <- data.table(name_vertex = vertex_names, 
                             id_cell = as.integer(vertex_names), 
                             id_vertex = 1:length(vertex_names)) 
setorder(graph_vertices, id_cell)

# there are 7 more vertices than there are 1-valued cells
print(paste0("n_vertices: ", nrow(graph_vertices)))
print(paste0("n_cells: ", sum(test_mat)))

print(paste0("n_extra_vertices: ", nrow(graph_vertices) - sum(test_mat)))

# (3) What are these extra vertices?
# get duplicated vertices
duplicated_vertices <- 
    graph_vertices[id_cell %in% graph_vertices[duplicated(id_cell),id_cell]]
setorder(duplicated_vertices, id_cell, id_vertex)

# the 7 additional vertices arise through duplication 
nrow(duplicated_vertices)
print(duplicated_vertices)

# xe+.. version has the incoming edges
incoming <- adjacent_vertices(g, duplicated_vertices$id_vertex, mode="in")
incoming[unlist(lapply(incoming, function(x) length(x) != 0))]
# x0.. version has outgoing edges
outgoing <- adjacent_vertices(g, duplicated_vertices$id_vertex, mode="out")
outgoing[unlist(lapply(outgoing, function(x) length(x) != 0))]

To (finally) get to the question. What is going on here? Is there something I can do to prevent this behaviour? The workaround I currently have is to take the incoming edges that are received by the xe+0.. version and add edges to these vertices for the x0.. version, before deleting the xe+0.. version of the vertex.

1

1 Answers

0
votes

The problem seems to be caused by R (or igraph) equating the two forms 100000 and 1e+05. I managed to resolve it by adding the statement options(scipen=99) at the start of the script, which stops R from using the e notation.