I am trying to load a csv into R that is c. 180k rows and c.9k columns. I only need a subset of c.100 columns, and know the names of the columns that I need.
The answers for this question cover a lot of the different options for selecting a subset of columns when importing a csv.
Using those answers, and my own knowledge, I have tried to use read.csv and read.table from utils, fread from data.table, and read_csv from readr. All of these functions seem to import the whole csv and then subset the columns - which is very inefficient where I only need such a small subset of columns. I have also tried to use read.csv.sql from sqldf, which seemed promising as selecting specific columns is a very common SQL task, but I could not import a subset of columns because I got the error Error: too many columns on file
.
Two contributions from the community that would really help me out:
- Can anyone point to a different import function that would be a lot more efficient where I only need a subset of columns?
- Can anyone give more context about how these functions work and correct my misunderstanding or reassure me that the solution I'm looking for doesn't exist?
Many thanks!
P.S. I haven't asked many questions on SO before so I'd appreciate feedback if I need to ask in a different way.
data.table::fread
orvroom::vroom
should let you load specific columns without loading the whole file into memory. – Gregor Thomas