5
votes

I have a CSV file the first row of which contains the variables names and the rest of the rows contains the data. What's a good way to break it up into files each containing just one variable in R? Is this solution going to be robust? E.g. what if the input file is 100G in size?

The input files looks like

var1,var2,var3
1,2,hello
2,5,yay
...

I want to create 3 (or however many variables) files var1.csv, var2.csv, var3.csv so that files resemble File1

var1
1
2
...

File2

var2?
2
5
...

File3

var3
hello
yay

I got a solution in Python (How to break a large CSV data file into individual data files?) but I wonder if R can do the same thing? Essential the Python code reads the csv file line by line and then writes the lines out one at a time. Can R do the same? The command read.csv reads the whole file all at once and this can slow the whole process down. Plus it can't read a 100G file and process it as R attempts to read the whole file into memory. I can't find a command in R that let's you read a csv file line by line. Please help. Thanks!!

1

1 Answers

7
votes

You can scan and then write to a file(s) one line at a time.

i <- 0
while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
       length(x) > 1}) {
  write(x[1], "file1.csv", sep = ",", append = T)
  write(x[2], "file2.csv", sep = ",", append = T)
  write(x[3], "file3.csv", sep = ",", append = T)
  i <- i + 1
}

edit!! I am using the above data, copied over 1000 times. I've done a comparison of speed when we have the file connection open at all times.

ver1 <- function() {
  i <- 0
  while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], "file1.csv", sep = ",", append = T)
    write(x[2], "file2.csv", sep = ",", append = T)
    write(x[3], "file3.csv", sep = ",", append = T)
    i <- i + 1
  }
}

system.time(ver1()) # w/ close to 3K lines of data, 3 columns
##    user  system elapsed 
##   2.809   0.417   3.629 

ver2 <- function() {
  f <- file("file.csv", "r")
  f1 <- file("file1.csv", "w")
  f2 <- file("file2.csv", "w")
  f3 <- file("file3.csv", "w")
  while({x <- scan(f, sep = ",", skip = 0, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], file = f1, sep = ",", append = T, ncol = 1)
    write(x[2], file = f2, sep = ",", append = T, ncol = 1)
    write(x[3], file = f3, sep = ",", append = T, ncol = 1)
  } 
  closeAllConnections()
}

system.time(ver2())
##   user  system elapsed 
##   0.257   0.098   0.409