10
votes

I would like to be able to skip a column that is read into R via data.table's fread function in v1.8.9. But the csv I am reading in, has no column headers…which appears to be a problem for fread... is there a way to just specify that I don't want specific columns?

Would it be better to just pre-allocate a column name and then let it read it in so that it can be skipped?

To give an example, I downloaded the data from the following URL

http://www.truefx.com/dev/data/2013/MAY-2013/AUDUSD-2013-05.zip

unzipped it…

and read the csv into R using fread and it has pretty much the same file name just with the csv extension.

system.time(pp <- fread("AUDUSD-2013-05.csv",sep=","))
  user  system elapsed 
16.427   0.257  16.682 

head(pp)
       V1                    V2      V3      V4
1: AUD/USD 20130501 00:00:04.728 1.03693 1.03721
2: AUD/USD 20130501 00:00:21.540 1.03695 1.03721
3: AUD/USD 20130501 00:00:33.789 1.03694 1.03721
4: AUD/USD 20130501 00:00:37.499 1.03692 1.03724
5: AUD/USD 20130501 00:00:37.524 1.03697 1.03719
6: AUD/USD 20130501 00:00:39.789 1.03697 1.03717

str(pp)
Classes ‘data.table’ and 'data.frame':  4060762 obs. of  4 variables:
$ V1: chr  "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
$ V2: chr  "20130501 00:00:04.728" "20130501 00:00:21.540" "20130501 00:00:33.789" "20130501 00:00:37.499" ...
$ V3: num  1.04 1.04 1.04 1.04 1.04 ...
$ V4: num  1.04 1.04 1.04 1.04 1.04 ...
- attr(*, ".internal.selfref")=<externalptr> 

I tried using the new(ish) colClasses or skip arguments to ignore the fact that the first column is all the same…and is unnecessary.

but doing:

pp1 <- fread("AUDUSD-2013-05.csv",sep=",",skip=1)

doesn't omit the reading in of the first column

and using colClasses leads to the following error

pp1 <- fread("AUDUSD-2013-05.csv",sep=",",colClasses=list(NULL,"character","numeric","numeric"))

Error in fread("AUDUSD-2013-05.csv", sep = ",", colClasses = list(NULL,  : 
 colClasses is type list but has no names

other attempts incude

pp1 <- fread("AUDUSD-2013-06.csv",sep=",", colClasses=c(V1=NULL,V2="character",V3="numeric",V4="numeric"))
str(pp1)
Classes ‘data.table’ and 'data.frame':  5524877 obs. of  4 variables:
 $ V1: chr  "AUD/USD" "AUD/USD" "AUD/USD" "AUD/USD" ...
 $ V2: chr  "20130603 00:00:00.290" "20130603 00:00:00.291" "20130603 00:00:00.292" "20130603 00:00:03.014" ...
 $ V3: num  0.962 0.962 0.962 0.962 0.962 ...
 $ V4: num  0.962 0.962 0.962 0.962 0.962 ...
 - attr(*, ".internal.selfref")=<externalptr>

i.e pretty much exactly the same as if I had not used colClasses...

Are there any suggestions to be able to speed up the reading in of data by omitting the first column?

Also perhaps a bit much to ask, but is it possible to directly read a zip file rather than unzipping it first and then reading in the csv?

Oh and if it wasn't clear I'm using data.table v1.8.9

1
R usually creates its own column names like you have above V1,V2 etc. So if you can read in the data like above than I'm not sure what the problem is? You could try to do colClasses = NULL without declaring that its a list. also you could fill in colClasses = c(col1,col2,col3,col4,col5)Sander Van der Zeeuw
skip argument is about skipping rows not columns; I'm not convinced that had there been a col.skip argument, you'd get a noticeable speed improvement - you'd still have to read in those chars before deciding to abandon themeddi
@SanderVanderZeeuw I tried your method...but got the following error: pp <- fread("AUDUSD-2013-05.csv",sep=",", colClasses=c(NULL,"character","numeric","numeric")) Error in fread("AUDUSD-2013-05.csv", sep = ",", colClasses = c(NULL, "character", : colClasses is unnamed and length 3 but there are 4 columns. See ?data.table for more info about the colClasses argument.h.l.m
@h.l.m U must try colClasses=NULL and nothing else. R usually knows how to deal with certain columns. And 1 more question you dont need the entire CSV file thus only the last 3 columns?Sander Van der Zeeuw
Yes I only want the last three columns....h.l.m

1 Answers

16
votes

I think the argument you're looking for is drop. Try:

require(data.table)  # 1.9.2+
pp <- fread("AUDUSD-2013-05.csv", drop = 1)

Note that you can drop by name or position.

fread("AUDUSD-2013-05.csv", drop = c("columThree","anotherColumnName"))

fread("AUDUSD-2013-05.csv", drop = 10:15)  # read all columns other than 10:15

And you can select by name or position, too.

fread("AUDUSD-2013-05.csv", select = 10:15)  # read only columns 10:15

fread("AUDUSD-2013-05.csv", select = c("columnA","columnName2"))

These arguments were added to v1.9.2 (released to CRAN in Feb 2014) and are documented in ?fread. You'll need to upgrade to use them.