I am using data.table and there are many functions which require me to set a key (e.g. X[Y]
). As such, I wish to understand what a key does in order to properly set keys in my data tables.
One source I read was ?setkey
.
setkey()
sorts adata.table
and marks it as sorted. The sorted columns are the key. The key can be any columns in any order. The columns are sorted in ascending order always. The table is changed by reference. No copy is made at all, other than temporary working memory as large as one column.
My takeaway here is that a key would "sort" the data.table, resulting in a very similar effect to order()
. However, it doesn't explain the purpose of having a key.
The data.table FAQ 3.2 and 3.3 explains:
3.2 I don't have a key on a large table, but grouping is still really quick. Why is that?
data.table uses radix sorting. This is signicantly faster than other sort algorithms. Radix is specically for integers only, see
?base::sort.list(x,method="radix")
. This is also one reason whysetkey()
is quick. When no key is set, or we group in a different order from that of the key, we call it an ad hoc by.3.3 Why is grouping by columns in the key faster than an ad hoc by?
Because each group is contiguous in RAM, thereby minimising page fetches, and memory can be copied in bulk (
memcpy
in C) rather than looping in C.
From here, I guess that setting a key somehow allows R to use "radix sorting" over other algorithms, and that's why it is faster.
The 10 minute quick start guide also has a guide on keys.
- Keys
Let's start by considering data.frame, specically rownames (or in English, row names). That is, the multiple names belonging to a single row. The multiple names belonging to the single row? That is not what we are used to in a data.frame. We know that each row has at most one name. A person has at least two names, a rst name and a second name. That is useful to organise a telephone directory, for example, which is sorted by surname, then rst name. However, each row in a data.frame can only have one name.
A key consists of one or more columns of rownames, which may be integer, factor, character or some other class, not simply character. Furthermore, the rows are sorted by the key. Therefore, a data.table can have at most one key, because it cannot be sorted in more than one way.
Uniqueness is not enforced, i.e., duplicate key values are allowed. Since the rows are sorted by the key, any duplicates in the key will appear consecutively
The telephone directory was helpful in understanding what a key is, but it seems that a key is no different when compared to having a factor column. Furthermore, it does not explain why is a key needed (especially to use certain functions) and how to choose the column to set as key. Also, it seems that in a data.table with time as a column, setting any other column as key would probably mess the time column too, which makes it even more confusing as I do not know if I am allowed set any other column as key. Can someone enlighten me please?