I try to use the group by function of the data.table package in R.
start <- as.Date('2014-1-1')
end <- as.Date('2014-1-6')
time.span <- seq(start, end, "days")
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> group V1
1: a 2.6667
2: b 4.3333
This works fine.
Since i am working with Dates it can happen that a special date not only has one but two groups.
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 c("a", "b")
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> Error in `[.data.table`(a, , mean(value), by = group) :
The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6).
I would like that the group date with both groups will be used for calculating the mean of group a as well as of group b.
Expected results:
mean a: 2.6667
mean b: 3.75
Is that possible with the data.table package?
Update
Thx to akrun my initial issue is solved. After "splitting" the data.table and in my case calculate different factors (based on the groups) i need the data.table back in its "original" form with unique rows based on the date. My solution so far:
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-02 2 b
4 2014-01-03 3 b
5 2014-01-04 4 b
6 2014-01-05 5 a
7 2014-01-06 6 b
# creates new column with mean based on group
b[,factor := mean(value), by=group]
#creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors
c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date]
date value group factor
01/01/14 1 a 2.666666667
02/01/14 2 c("a", "b") 10
03/01/14 3 b 3.75
04/01/14 4 b 3.75
05/01/14 5 a 2.666666667
06/01/14 6 b 3.75
I guess it is not the perfect way to do it, but it works. Any suggestions how i could do it better?
Alternative solution (really slow!!!):
d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group]
for(i in 1:NROW(a)){
y1 <- 1
for(j in a[i,group][[1]]){
y1 <- y1 * d[group==j, V1]
}
a[i, factor := y1]
}
My fastest solution so far:
# split rows that more than one group
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
# calculate mean of different groups
b <- b[,factor := mean(value), by=group]
# only keep date + factor columns
b <- b[,.(date, factor)]
# summarise rows by date
b <- b[,lapply(.SD,prod), by=date]
# add summarised factor column to initial data.table
c <- merge(a,b,by='date')
Any chance to make it faster?