0
votes

I have a dataframe with over 20 millions records. I created chunks so I can load small sets into multiple csv files. The code is working and it's creating n number of csv files of equal size.

This is the code I am using:

n = 14 # defining the number of chunks
 
df = split(df_t3, factor(sort(rank(row.names(df))%%n))) # breaking into 14 list

lapply(names(df), function(x){
   write.csv(df[[x]], paste(x, ".txt", sep = ""), row.names = FALSE) #creating csv files
})

I want to modify this as such each chunks capture an entire set records related to a same ID before breaking off the file.

For example ID10 = 300 rows, ID 20 = 500 rows. The files should capture entire 300 records before jumping into next chunk. I have more then 1mm ID's thus cannot use ID as an criteria to break into chunks.

Not sure if I am really clear in my request. Happy to provide more clarity. Thanks

1

1 Answers

0
votes

If your IDs are even close to balanced, I would split into chunks of roughly the same number of IDs per chunk. Something like this:

ids = unique(df_t3$ID)
n_chunks = 14
id_chunk = ceiling(seq(1e-5, n_chunks, length.out = length(ids)))

for(i in seq_len(n_chunks)) {
  write.csv(
    df_t3[df_t3$ID %in% ids[id_chunk == i], ],
    file = paste0("chunk", i, ".csv"),
    row.names = FALSE
  )
}