3
votes

I have grouped data with ordering within the groups where each row contains a list of values and within each group I'd like to produce a count of new list values contributed by each row to the union of the lists in each group.

Here is an example:

require(dplyr)
content <- list(c("A", "B"), c("A", "B", "C"), c("D", "E"), c("A", "B"), c("A", "B"), c("A", "B", "C"))
id <- c("a", "a", "a", "b", "b", "b")
order <- c(5, 7, 3, 1, 9, 4)
testdf <- data.frame(id, order, cbind(content))
testdf
#   id order content
# 1  a     5    A, B
# 2  a     7 A, B, C
# 3  a     3    D, E
# 4  b     1    A, B
# 5  b     9    A, B
# 6  b     4 A, B, C

My desired output (after sorting by order descending within each group) would be like:

#   id order content cc
# 1  a     7 A, B, C 3
# 2  a     5    A, B 3
# 3  a     3    D, E 5
# 4  b     9    A, B 2
# 5  b     4 A, B, C 3
# 6  b     1    A, B 3

cn (cumulative new) would be preferable to cc (cumulative count) really, but the above maps to my attempt below and cn is easily calculated subsequently. Here is my attempted solution that doesn't work:

res <- testdf %>% 
  arrange(id, desc(order)) %>% 
  mutate(n=row_number()) %>%
  group_by(id) %>%
  mutate(n1=first(n)) %>%
  rowwise() %>%
  bind_cols(do(.,data.frame(vars=length(unique(unlist(testdf$content[.$n1:.$n])))))) %>%
  data.frame

I actually obtained most of that solution from here: Cumulatively paste (concatenate) values grouped by another variable (thanks akrun). The values generated seem to be correct but they are not associated with the correct rows from the source data frame:

res
#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    2
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    2
# 6  b     1    A, B 6  4    3

As you can see (looking at the vars column which is equivalent to cc above) for group 'a' values 2 and 3 are reversed and for group 'b' the second 2 and 3 values are reversed.

Actually I worked out what is wrong above, the testdf$content is (obviously) not ordered the same as the dplyr'd data frame. Originally I'd had .$content instead of testdf$content and that had produced even odder output. So I tried doing it in two stages:

res <- testdf %>% 
    arrange(id, desc(order)) %>% 
    mutate(n=row_number()) %>%
    group_by(id) %>%
    mutate(n1=first(n))
res <- res %>% 
    rowwise() %>%
    bind_cols(do(.,data.frame(vars=length(unique(unlist(res$content[.$n1:.$n])))))) %>%
    data.frame

and this produces what I expect:

#   id order content n n1 vars
# 1  a     7 A, B, C 1  1    3
# 2  a     5    A, B 2  1    3
# 3  a     3    D, E 3  1    5
# 4  b     9    A, B 4  4    2
# 5  b     4 A, B, C 5  4    3
# 6  b     1    A, B 6  4    3

So my question now is is there a better way to refer to the whole dplyr-modified data frame inside the do() (so that content is ordered correctly) - I think . is just the current row isn't it? Being able to do so would avoid me having to create the ordered data frame separately before the do().

Many thanks

Tim

1
I'm a bit confused with all the steps, but assuming you have ordered and grouped your data appropriately, you could use cumsum(!duplicated(unlist(x)))[cumsum(lengths(x))] to count cumulatively, where x is the ordered "content" -- e.g. list(c("A", "B", "C"), c("A", "B"), c("D", "E")) for the ordered "content" in group "a" and list(c("A", "B"), c("A", "B", "C"), c("A", "B")) in group "b".alexis_laz
Thanks for your reply - I had a quick go but I'm not sure where to try that, should it replace the whole rowwise() and bind_cols(do())? I naively tried res %>% cumsum(!duplicated(unlist(content)))[cumsum(lengths(content))] which gave NA's?Tim
Following your code, I had something like testdf %>% arrange(id, desc(order)) %>% group_by(id) %>% mutate(cumsum(!duplicated(unlist(content)))[cumsum(lengths(content))]) in mindalexis_laz
Ok thanks - that worked for me although I'm not quite sure how the selection from the list using the cumsum(lengths(content)) works. I think @Psidom's solution is perhaps easier to understand and so I'll accept that as the solution. Thanks again for contributing.Tim
By the way - I like your use of methods with "cumulative" naming for this purpose although the double use is a bit confusing.Tim

1 Answers

1
votes

You can use the Reduce function with the accumulate mode to create cumulatively distinct elements and then use lengths function to return the cumulative distinct counts, this avoids the rowwise() operation:

library(dplyr)
testdf %>% 
          arrange(desc(order)) %>% 
          group_by(id) %>% 
          mutate(cc = lengths(Reduce(function(x, y) unique(c(x, y)), content, acc = T))) %>% 
          arrange(id)

#Source: local data frame [6 x 4]
#Groups: id [2]

#      id order   content    cc
#  <fctr> <dbl>    <list> <int>
#1      a     7 <chr [3]>     3
#2      a     5 <chr [2]>     3
#3      a     3 <chr [2]>     5
#4      b     9 <chr [2]>     2
#5      b     4 <chr [3]>     3
#6      b     1 <chr [2]>     3