0
votes

I have a data.frame with 93 millon elements and 3 numeric variables. The third variable, "component" groups some rows by an id. The data consists of an edge list of a huge graph, the component number indicates the rows that belong to the same connected component. There are about 83 million such components.

I am now trying t split the data frame into a list 83 million of data.frames. I do this in order to apply some igraph functions to each component.

This SO answer indicates that split() is the solution for this.

library(dplyr,data.table,igraph)

# d6b: data.frame with edge A, edge B, component, 93 millon rows, 83 million components, object.size=2,4Gb
d6b <- d6a %>% split(f = d6a$component )
# This takes 7,1 hours to run, and creates a 94.8 Gb object

#Then try to run igraph on each element of the list
d6b %>% lapply(graph_from_data_frame,directed = TRUE) -> g6a
#code above ran for 20 hours without finishing

Is there a faster way to do this? Is there another structure that does not become so large?

EDIT: based on Gregor's comment bellow I changed the workflow:

#Selecting only the non trivial components 
# removing all 1:n or n:1 (incluind the 70mi 1:1)
d6a %>% group_by(component) %>% 
  mutate(N_edges=n(),
         N_cpf=n_distinct(cpf),
         N_pis=n_distinct(pis)) -> d6b #takes 1h
d6b_dt <- data.table(d6b) # takes 11min
d6b_dtf <- d6b_dt[N_cpf>1 & N_pis>1] # 5s
setkey(d6b_dtf, component) #1s

Then try to implement the suggestion:

d6b_dtf %>% group_by(component) %>% select(cpf,pis) %>% 
  do(graph_from_data_frame, directed = TRUE) -> g_d6b_dtf

I get the following error message:

Adding missing grouping variables: `component`
Error: Arguments to do() must either be all named or all unnamed
This seems like a terrible idea. Almost all your little data frames will be 1 row. What are you going to graph with 1-row data frame? If you filter out the unduplicated components you'll be down to ~10M components or less which is much more reasonable.Gregor Thomas
Also, as it appears you're using dplyr, why are you splitting at all? Just group_by(component) %>% do(graph_from_data_frame, directed = TRUE). And of course this will go much faster if your data is a keyed data.table.Gregor Thomas
getting there, see edits above. Did not quite make it yetLucasMation
Looking at examples in ?do will probably help. I don't know what the graph_from_data_frame function does - are you using it for side effects like saving a plot? Or is it returning a plot object you want to store in a list? Something else? - so it's hard to know exactly what do syntax is needed here. I'd still recommend converting to data table first - I'm sure the setkey will take some time, but it will probably be faster overall. Also don't forget that you can experiment with syntax on a tiny subset of your data, and when it's working try to scale it up.Gregor Thomas
again tks a bunch! what the igraph package does is create this "graph" objects (as in graph theory, not plots), which is a data structure describing the relationships between Vertex connected by edges. graph_from_data_frame takes an edge list (a data frame in which every row represents and edge of a graph) and converts that into the "graph" object. igraph include other functions and methods, as, for instance, a function to plot a graph. What I want is a list of "graph"sLucasMation