I have a data.frame with 93 millon elements and 3 numeric variables. The third variable, "component" groups some rows by an id. The data consists of an edge list of a huge graph, the component number indicates the rows that belong to the same connected component. There are about 83 million such components.
I am now trying t split the data frame into a list 83 million of data.frames. I do this in order to apply some igraph functions to each component.
This SO answer indicates that split()
is the solution for this.
library(dplyr,data.table,igraph)
# d6b: data.frame with edge A, edge B, component, 93 millon rows, 83 million components, object.size=2,4Gb
d6b <- d6a %>% split(f = d6a$component )
# This takes 7,1 hours to run, and creates a 94.8 Gb object
#Then try to run igraph on each element of the list
d6b %>% lapply(graph_from_data_frame,directed = TRUE) -> g6a
#code above ran for 20 hours without finishing
Is there a faster way to do this? Is there another structure that does not become so large?
EDIT: based on Gregor's comment bellow I changed the workflow:
#Selecting only the non trivial components
# removing all 1:n or n:1 (incluind the 70mi 1:1)
d6a %>% group_by(component) %>%
mutate(N_edges=n(),
N_cpf=n_distinct(cpf),
N_pis=n_distinct(pis)) -> d6b #takes 1h
d6b_dt <- data.table(d6b) # takes 11min
d6b_dtf <- d6b_dt[N_cpf>1 & N_pis>1] # 5s
setkey(d6b_dtf, component) #1s
Then try to implement the suggestion:
d6b_dtf %>% group_by(component) %>% select(cpf,pis) %>%
do(graph_from_data_frame, directed = TRUE) -> g_d6b_dtf
I get the following error message:
Adding missing grouping variables: `component`
Error: Arguments to do() must either be all named or all unnamed
dplyr
, why are yousplit
ting at all? Justgroup_by(component) %>% do(graph_from_data_frame, directed = TRUE)
. And of course this will go much faster if your data is a keyed data.table. – Gregor Thomas?do
will probably help. I don't know what thegraph_from_data_frame
function does - are you using it for side effects like saving a plot? Or is it returning a plot object you want to store in a list? Something else? - so it's hard to know exactly whatdo
syntax is needed here. I'd still recommend converting to data table first - I'm sure thesetkey
will take some time, but it will probably be faster overall. Also don't forget that you can experiment with syntax on a tiny subset of your data, and when it's working try to scale it up. – Gregor Thomas