I have grouped data with ordering within the groups where each row contains a list of values and within each group I'd like to produce a count of new list values contributed by each row to the union of the lists in each group.
Here is an example:
require(dplyr)
content <- list(c("A", "B"), c("A", "B", "C"), c("D", "E"), c("A", "B"), c("A", "B"), c("A", "B", "C"))
id <- c("a", "a", "a", "b", "b", "b")
order <- c(5, 7, 3, 1, 9, 4)
testdf <- data.frame(id, order, cbind(content))
testdf
# id order content
# 1 a 5 A, B
# 2 a 7 A, B, C
# 3 a 3 D, E
# 4 b 1 A, B
# 5 b 9 A, B
# 6 b 4 A, B, C
My desired output (after sorting by order descending within each group) would be like:
# id order content cc
# 1 a 7 A, B, C 3
# 2 a 5 A, B 3
# 3 a 3 D, E 5
# 4 b 9 A, B 2
# 5 b 4 A, B, C 3
# 6 b 1 A, B 3
cn (cumulative new) would be preferable to cc (cumulative count) really, but the above maps to my attempt below and cn is easily calculated subsequently. Here is my attempted solution that doesn't work:
res <- testdf %>%
arrange(id, desc(order)) %>%
mutate(n=row_number()) %>%
group_by(id) %>%
mutate(n1=first(n)) %>%
rowwise() %>%
bind_cols(do(.,data.frame(vars=length(unique(unlist(testdf$content[.$n1:.$n])))))) %>%
data.frame
I actually obtained most of that solution from here: Cumulatively paste (concatenate) values grouped by another variable (thanks akrun). The values generated seem to be correct but they are not associated with the correct rows from the source data frame:
res
# id order content n n1 vars
# 1 a 7 A, B, C 1 1 2
# 2 a 5 A, B 2 1 3
# 3 a 3 D, E 3 1 5
# 4 b 9 A, B 4 4 2
# 5 b 4 A, B, C 5 4 2
# 6 b 1 A, B 6 4 3
As you can see (looking at the vars column which is equivalent to cc above) for group 'a' values 2 and 3 are reversed and for group 'b' the second 2 and 3 values are reversed.
Actually I worked out what is wrong above, the testdf$content is (obviously) not ordered the same as the dplyr'd data frame. Originally I'd had .$content
instead of testdf$content
and that had produced even odder output. So I tried doing it in two stages:
res <- testdf %>%
arrange(id, desc(order)) %>%
mutate(n=row_number()) %>%
group_by(id) %>%
mutate(n1=first(n))
res <- res %>%
rowwise() %>%
bind_cols(do(.,data.frame(vars=length(unique(unlist(res$content[.$n1:.$n])))))) %>%
data.frame
and this produces what I expect:
# id order content n n1 vars
# 1 a 7 A, B, C 1 1 3
# 2 a 5 A, B 2 1 3
# 3 a 3 D, E 3 1 5
# 4 b 9 A, B 4 4 2
# 5 b 4 A, B, C 5 4 3
# 6 b 1 A, B 6 4 3
So my question now is is there a better way to refer to the whole dplyr-modified data frame inside the do()
(so that content
is ordered correctly) - I think .
is just the current row isn't it? Being able to do so would avoid me having to create the ordered data frame separately before the do()
.
Many thanks
Tim
cumsum(!duplicated(unlist(x)))[cumsum(lengths(x))]
to count cumulatively, wherex
is the ordered "content" -- e.g.list(c("A", "B", "C"), c("A", "B"), c("D", "E"))
for the ordered "content" in group "a" andlist(c("A", "B"), c("A", "B", "C"), c("A", "B"))
in group "b". – alexis_lazrowwise()
andbind_cols(do())
? I naively triedres %>% cumsum(!duplicated(unlist(content)))[cumsum(lengths(content))]
which gave NA's? – Timtestdf %>% arrange(id, desc(order)) %>% group_by(id) %>% mutate(cumsum(!duplicated(unlist(content)))[cumsum(lengths(content))])
in mind – alexis_laz