8
votes

The function max() operates correctly on column of type ordered factor. However, the same operation fails when the column is grouped with by=.

Let's say I have a data.table as:

DT <- data.table(ID=rep(1:3, 3), State=sample(LETTERS[1:3], 9, replace=TRUE))

Convert the column State to ordered factor as:

DT[, State := factor(State, levels=LETTERS[1:3], ordered = TRUE)]

This works:

DT[, max(State)]

This fails with error:

DT[, max(State), by="ID"]

Error is: Error in gmax(State) : max is not meaningful for factors.

How come?

1
This solves the problem, and I don't understand the mechanics of it. Please help. DT[, min(ordered(State)), by="ID"]Sun Bee
DT[, max(ordered(State)), by="ID"] is giving me an error, but DT[, State[which.max(as.numeric(State))], by = ID] works. Not sure why DT[, max(State), by="ID"] gives an error though, especially since DT[, class(State), by = ID] shows it's still an ordered factor after grouping.IceCreamToucan
@Ryan There are 6 rows in the query DT[, class(State), by = ID]. Not sure why that happens, or if it's relevant.Ameya
See ?GForce. They just haven't coded it for ordered factors yet, I guess. Issue opened over here github.com/Rdatatable/data.table/issues/1947Frank

1 Answers

4
votes

This was a bug that has been fixed in the current development version of data.table.

You can install the development version via:

install.packages('data.table', type = 'source',
                 repos = 'http://Rdatatable.github.io/data.table')

If this fails, check full details on the Installation wiki.

library(data.table)
# data.table 1.11.5 IN DEVELOPMENT built 2018-08-13 20:20:11 UTC; travis  Latest news: r-datatable.com
DT[ , max(State), by="ID"]
#    ID V1
# 1:  1  C
# 2:  2  C
# 3:  3  B

For those in controlled/production environments unable to update, you can still sidestep the problem by running:

dt_optim = options(datatable.optimize = 0) 
DT[ , max(State), by="ID"]
# resetting afterwards to keep your code running as fast as possible
options(datatable.optimize = dt_optim)

The bug came from data.table's internally optimized grouping framework GForce; the above workaround stops this code from executing and defaults to base::max.