18
votes

This (very basic) question is the result of an exchange here.

The documentation for setkey() states:

setkey() sorts a data.table and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference... (emphasis added)

I have always interpreted this to mean that setkey() creates an index, rather than physically rearranging the rows of the data table (similar to indexing a database table). But if this was true then removing the key (using setkey(DT,NULL)), should remove the index and restore the data table to it's original, unsorted order. This is not what happens:

library(data.table)
DT <- data.table(a=3:1, b=1:3, c=5:7); DT
   a b c
1: 3 1 5
2: 2 2 6
3: 1 3 7
setkey(DT,a); DT
   a b c
1: 1 3 7
2: 2 2 6
3: 3 1 5
setkey(DT,NULL)
   a b c
1: 1 3 7
2: 2 2 6
3: 3 1 5

So two questions:

1: If the rows are rearranged (sorted), then what does "changed by reference" mean?

2: What does setkey(DT,NULL) do exactly?

1
I don't know the answer, but keep in mind that just because the table is displayed as sorted does not mean that it was sorted when you set the key. Typing DT at the console is essentially the same as calling a print function, and that function might be doing the sorting.joran

1 Answers

12
votes
  1. The rows are sorted. "Changed by reference" here means there is no copying of the entire table and rows are just swapped.

  2. setkey(DT, NULL) is equivalent to setattr(DT, "sorted", NULL). It simply unsets the "sorted" attribute.