4
votes

update (June 2016)

col.names was added on data.table 1.9.6 so issue is over and everyone super happy :) I think I can now convert all my read.csv calls to fread calls without worries of destruction

original question

using data.table 1.9.4

I'm importing read.csv calls to fread due to HUGE performance improvements we've noticed. Most issues I can handle but I've reached a point where I'm clueless and wonder if anyone has an elegent solution.

My problem is that I have named colClasses but the input has no header (it's a grep function), here's a silly example to make sense:

males.students <- read.csv(pipe("grep Male students.csv"), 
                           col.names=c("id", "name", "gender"), 
                           colClasses=(id="numeric"))

now in fread I still want the named colClasses but I have no col names so just using

males.students <- fread("grep Male students.csv"), 
                        colClasses=(id="numeric"))

fails with

Column name 'id' in colClasses[[1]] not found

How can I fix that? are there plans to add col.names?

2
Since you know that it's the first column, you can just use column number for this. colClasses=list(numeric=1)).shadow
Try using the select argument as wellRich Scriven
@shadow this was obviously an extremely simplified example - the real csv has 40 column and I'm setting colClasses for 19 of them, and I really don't think it's reasonable to use column numbers, u want your system to be robust and allow changes without having your data mysteriously change types without ever figuring out why.ben.pere
Just curious, why do you need to set 19 col classes? What classes are they? If the data is clean they should read in correctly.Rich Scriven
@RichardScriven select only lets you choose the columns u want returned by name/index, not change their names. I need to set colClasses because some columns might look like numbers but are actual only meaningful as characters, e.g. in financials an exchange order ids "00000000123" and "00000123" are different orders and that's only represented as characters but not as numbers.ben.pere

2 Answers

3
votes

Add the names in the command line:

fread('echo "id,name,gender"; grep Male students.csv', colClasses = c(id='numeric'))
0
votes

Answering the original question, if the problem is that grep removes the header, you could use awk instead, to print the first line and any lines containing "Male":

fread("awk 'NR==1 || /Male/' students.csv"), colClasses=(id="numeric"))

This might help people that still use the old version of data.table.