11
votes

I am trying to find a way to determine when a set of columns changes value in a data.frame. Let me get straight to the point, please consider the following example:

x<-data.frame(cnt=1:10, code=rep('ELEMENT 1',10), val0=rep(5,10), val1=rep(6,10),val2=rep(3,10))
x[4,]$val0=6
  • The cnt column is a unique ID (could be a date, or time column, for simplicity it's an int here)
  • The code column is like an code for the set of rows (imagine several such groups but with different codes). The code and cnt are the keys in my data.table.
  • The val0,val1,val2 columns are something like scores.

The data.frame above should be read as: The scores for 'ELEMENT 1' started as 5,6,3, remained as is until the 4 iteration when they changed to 6,6,3, and then changed back to 5,6,3.

My question, is there a way to get the 1st, 4th, and 5th row of the data.frame? Is there a way to detect when the columns change? (There are 12 columns btw)

I tried using the duplicated of data.table (which worked perfectly in the majority of the cases) but in this case it will remove all duplicates and leave rows 1 and 4 only (removing the 5th).

Do you have any suggestions? I would rather not use a for loop as there are approx. 2M lines.

2

2 Answers

13
votes

In data.table version 1.8.10 (stable version in CRAN), there's a(n) (unexported) function called duplist that does exactly this. And it's also written in C and is therefore terribly fast.

require(data.table) # 1.8.10
data.table:::duplist(x[, 3:5]) 
# [1] 1 4 5

If you're using the development version of data.table (1.8.11), then there's a more efficient version (in terms of memory) renamed as uniqlist, that does exactly the same job. Probably this should be exported for next release. Seems to have come up on SO more than once. Let's see.

require(data.table) # 1.8.11
data.table:::uniqlist(x[, 3:5])
# [1] 1 4 5
3
votes

Totally unreadable, but:

c(1,which(rowSums(sapply(x[,grep('val',names(x))],diff))!=0)+1)
# [1] 1 4 5

Basically, run diff on each row, to find all the changes. If a change occurs in any column, then a change has occurred in the row.

Also, without the sapply:

c(1,which(rowSums(diff(as.matrix(x[,grep('val',names(x))])))!=0)+1)