24
votes

I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.

Take this example:

dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')

How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:

dat[, cols] <- lapply(dat[, cols], factor)

but that doesn't work for a data.table, and neither does this

dat[, .SD := lapply(.SD, factor), .SDcols = cols]

A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.

for (j in cols) set(dat, j = j, value = factor(dat[[j]]))

Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?

2
I believe method Matt Dowle recommends would be best. He is, after all, the data.table author.Rich Scriven
True, but the comment was from 2013 and there have been many package updates since then, so I thought it worth throwing this fishing line outarvi1000
More detail on the for(...) set(...) idiom added recently here: stackoverflow.com/a/33000778/403310Matt Dowle

2 Answers

49
votes

Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:

dat[, (cols) := lapply(.SD, factor), .SDcols = cols]

By using the := operator you update the datatable by reference. A check whether this worked:

> sapply(dat,class)
       ID   Quarter     value 
 "factor"  "factor" "numeric" 

As suggeted by @MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:

for (col in cols) set(dat, j = col, value = factor(dat[[col]]))

which will give the same result. A third alternative is:

for (col in cols) dat[, (col) := factor(dat[[col]])]

On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:

dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
                  Quarter=c(1:1e6, 1:1e6),
                  value=rnorm(10))

Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:

dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]


WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by @Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.

When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:

dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)

A check whether this worked:

> sapply(dat,class)
       ID   Quarter     value 
 "factor"  "factor" "numeric" 

If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:

> dat[, cols]
[1] "ID"      "Quarter"

> dat[, cols, with = FALSE]
    ID Quarter
 1:  A       1
 2:  A       2
 3:  A       3
 4:  A       4
 5:  A       5
 6:  B       1
 7:  B       2
 8:  B       3
 9:  B       4
10:  B       5
3
votes

You can use .SDcols:

dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]