What is the most efficient read csv function / package in R at importing a subset of csv columns

Question

I am trying to load a csv into R that is c. 180k rows and c.9k columns. I only need a subset of c.100 columns, and know the names of the columns that I need.

The answers for this question cover a lot of the different options for selecting a subset of columns when importing a csv.

Using those answers, and my own knowledge, I have tried to use read.csv and read.table from utils, fread from data.table, and read_csv from readr. All of these functions seem to import the whole csv and then subset the columns - which is very inefficient where I only need such a small subset of columns. I have also tried to use read.csv.sql from sqldf, which seemed promising as selecting specific columns is a very common SQL task, but I could not import a subset of columns because I got the error Error: too many columns on file.

Two contributions from the community that would really help me out:

Can anyone point to a different import function that would be a lot more efficient where I only need a subset of columns?
Can anyone give more context about how these functions work and correct my misunderstanding or reassure me that the solution I'm looking for doesn't exist?

Many thanks!

P.S. I haven't asked many questions on SO before so I'd appreciate feedback if I need to ask in a different way.

If you know the columns you want to read in, is pre-processing the file an option? What is your OS? — Rui Barradas
"All of these functions seem to import the whole csv and then subset the columns - which is very inefficient where I only need such a small subset of columns." This isn't true. My experience is fread is the fastest. — thc
Either data.table::fread or vroom::vroom should let you load specific columns without loading the whole file into memory. — Gregor Thomas
Thanks for the useful comments all. I'll try to do some benchmarking on the different options I've tried and other suggestions people have made to say something illustrative about efficiency and then update. — Keynes2.0

G. Grothendieck G. Grothendieck · Accepted Answer · 2020-11-05T19:37:34

There are many command line utilities (e.g. sed, awk, cut, csvfix, miller, csvkit, csvtk) that can do this. Below we use xsv. This removes the unwanted columns before the file ever reaches R. Use the full path to xsv if iot is not already on the PATH. xsv accepts column names as shown below or field numbers.

# write out test data
write.csv(iris, "iris-test.csv", quote = FALSE, row.names = FALSE)

cmd <- "xsv select Sepal.Length,Petal.Length-Species iris-test.csv"
DF <- read.csv(pipe(cmd))

head(DF)
##   Sepal.Length Petal.Length Petal.Width Species
## 1          5.1          1.4         0.2  setosa
## 2          4.9          1.4         0.2  setosa
## 3          4.7          1.3         0.2  setosa
## 4          4.6          1.5         0.2  setosa
## 5          5.0          1.4         0.2  setosa
## 6          5.4          1.7         0.4  setosa

Or with UNIX cut (also available in Windows Rtools in \Rtools40\usr\bin in R 4.0+) the following works. Use the full path to cut if cut is not already on your PATH.

cmd2 <- "cut -d, -f 1,3-5 iris-test.csv"
DF <- read.csv(pipe(cmd2))

What is the most efficient read csv function / package in R at importing a subset of csv columns

2 Answers