0
votes

I am trying to create a function that will make R able to read each singular unit (ID) and count the number of co-occurrences of specific characters in that unit. The dataset is here below:

       ID    class  weight
1       1       A       1.0
2       1       A       1.0
3       1       B       1.0
4       2       A       1.0
5       2       B       1.0
6       2       C       1.0
7       3       B       1.0
8       4       B       1.0
9       4       C       1.0
10      4       C       1.0
11      4       D       1.0
12      4       D       1.0
13      5       A       0.9
14      5       B       0.9
15      5       C       0.9
16      5       D       0.9
17      6       B       0.8
18      6       B       0.8
19      7       C       0.7
20      7       C       0.7
21      7       D       0.7
22      7       D       0.7
23      8       C       0.6
24      8       D       0.6
25      9       D       0.5
26      9       E       0.5
27      9       E       0.5
28     10       C       0.4
29     10       C       0.4
30     10       C       0.4
31     10       E       0.4
32     11       A       0.3
33     11       A       0.3
34     11       A       0.3
35     12       A       0.2
36     12       B       0.2
37     12       C       0.2
38     13       B       0.1
39     13       D       0.1
40     13       D       0.1
41     13       E       0.1
42     14       D       1.0
43     14       E       1.0
44     15       B       1.0
45     15       B       1.0
46     15       C       1.0
47     15       C       1.0
48     15       D       1.0
49     16       C       1.0
50     16       D       1.0
51     16       E       1.0
52     16       E       1.0
53     17       B       1.0
54     17       C       1.0
55     17       C       1.0
56     18       D       1.0
57     18       D       1.0
58     18       E       1.0
59     19       E       1.0
60     19       E       1.0
61     20       B       1.0
62     20       D       1.0
63     20       E       1.0
64     20       E       1.0

I tried to create a loop function, but I don't know how to correctly specificy the expression. R should recognize ID from 1 up to 20, and in each ID count how many times the characters co-occur together. Not only that, each co-occurrence has to be weighted by the specific weight of the ID. Any thoughts about generating a loop function?

Some specifics: In ID 1 class A and B co-occur two times (first A with B and second A with B), which multiplied by the weight (1) gives a preliminary value of 2. The co-occurrence value of A and B should be 4.1 after the entire list is completed by the loop, and that value should be reported in a matrix 5x5 that looks like this:

   A   B    C    D    E
A  1   4.1  ..
B 4.1  1    ..
C ..   ..   1
D ..             1
E ..                  1

Co-occurrence between identical classes will be just 1.

dput(data) structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 18L, 18L, 18L, 19L, 19L, 20L, 20L, 20L, 20L), class = c("A", "A", "B", "A", "B", "C", "B", "B", "C", "C", "D", "D", "A", "B", "C", "D", "B", "B", "C", "C", "D", "D", "C", "D", "D", "E", "E", "C", "C", "C", "E", "A", "A", "A", "A", "B", "C", "B", "D", "D", "E", "D", "E", "B", "B", "C", "C", "D", "C", "D", "E", "E", "B", "C", "C", "D", "D", "E", "E", "E", "B", "D", "E", "E"), weight = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.9, 0.9, 0.9, 0.9, 0.8, 0.8, 0.7, 0.7, 0.7, 0.7, 0.6, 0.6, 0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -64L), class = c("data.table", "data.frame"), .internal.selfref = ) gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 2672851 142.8 4316924 230.6 4316924 230.6 Vcells 5761794 44.0 12425324 94.8 29629603 226.1 library(data.table) data <- fread("toy.csv") dput(data) structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 18L, 18L, 18L, 19L, 19L, 20L, 20L, 20L, 20L), class = c("A", "A", "B", "A", "B", "C", "B", "B", "C", "C", "D", "D", "A", "B", "C", "D", "B", "B", "C", "C", "D", "D", "C", "D", "D", "E", "E", "C", "C", "C", "E", "A", "A", "A", "A", "B", "C", "B", "D", "D", "E", "D", "E", "B", "B", "C", "C", "D", "C", "D", "E", "E", "B", "C", "C", "D", "D", "E", "E", "E", "B", "D", "E", "E"), weight = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.9, 0.9, 0.9, 0.9, 0.8, 0.8, 0.7, 0.7, 0.7, 0.7, 0.6, 0.6, 0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -64L), class = c("data.table", "data.frame"), .internal.selfref = )

3
dput(mat) does not match your first data frame. Was this intentional? Can you dput the one at the top of your post?Marian Minar
Yeah, sorry about that. My brain was fried when I posted it. It is now the correct version.Boletus
The syntax of your new dput has some issues, but selecting the trailing structure seems to replicate your dataMarian Minar
How is it counted if say, for id = n, we have c("A", "A", "B", "B"). Is this counted as 4 or 2, or ...?Marian Minar
4 would be what I am looking for.Boletus

3 Answers

1
votes

Here is one way:

library(tidyverse)

Data

data <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 18L, 18L, 18L, 19L, 19L, 20L, 20L, 20L, 20L), class = c("A", "A", "B", "A", "B", "C", "B", "B", "C", "C", "D", "D", "A", "B", "C", "D", "B", "B", "C", "C", "D", "D", "C", "D", "D", "E", "E", "C", "C", "C", "E", "A", "A", "A", "A", "B", "C", "B", "D", "D", "E", "D", "E", "B", "B", "C", "C", "D", "C", "D", "E", "E", "B", "C", "C", "D", "D", "E", "E", "E", "B", "D", "E", "E"), weight = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.9, 0.9, 0.9, 0.9, 0.8, 0.8, 0.7, 0.7, 0.7, 0.7, 0.6, 0.6, 0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), row.names = c(NA, -64L), class = c("data.table", "data.frame")) %>% as_tibble()

Main

Create a "count" data frame:

(df <- data %>%
  count(ID, class, weight) %>%
  spread(class, n, fill = 0))

Get all combinations of classes:

eg <- expand.grid(unique(data$class), unique(data$class), stringsAsFactors = FALSE)

Make a tibble with the pairs and weighted counts:

final <- map2(
  eg$Var1,
  eg$Var2,
  ~ df %>% select(.x, .y, weight) %>% 
    mutate(counts = !!sym(.x) * !!sym(.y)) %>%
    mutate(wt_counts = counts * weight) %>%
    select(wt_counts) %>%
    sum() %>%
    tibble(Var1 = .x, Var2 = .y, wt_count = .)
) 

Convert to a matrix:

finalmatrix <- bind_rows(final) %>%
  mutate(wt_count = ifelse(Var1 == Var2, 1, wt_count)) %>%
  spread(Var2, wt_count) %>%
  select(-Var1) %>%
  as.matrix()

Finally, set names:

row.names(finalmatrix) <- colnames(finalmatrix)

Result

> finalmatrix
    A    B    C    D   E
A 1.0  4.1  2.1  0.9 0.0
B 4.1  1.0 10.1  6.1 2.1
C 2.1 10.1  1.0 11.3 3.2
D 0.9  6.1 11.3  1.0 8.2
E 0.0  2.1  3.2  8.2 1.0

Note

I personally don't like how long my solution was and I couldn't see a way around using rlang stuff (!!sym()), but it works nonetheless.

1
votes

EDIT:

Modified to match @Marian Minar's answer. Also added a tidyverse solution which is the fastest of the three methods for this small dataset.

Tidyverse:

  mat_ans_2 <- DF%>%
    count(ID, class, weight)%>%
    inner_join(., ., by = 'ID')%>%
    filter(class.x != class.y)%>%
    group_by(class.x, class.y)%>%
    summarize(co_occur = sum(weight.x * n.x * n.y))%>%
    spread(key = 'class.x', value = 'co_occur', fill = 0L)%>%
    column_to_rownames('class.y')%>%
    as.matrix()

  diag(mat_ans_2) <- 1L

data.table - slower on this dataset

dt <- as.data.table(DF)[, .N, by = .(ID, class, weight)]

dt2 <- dt[dt, on = 'ID', .(class, i.class, weight, N, i.N), by = .EACHI, allow.cartesian = T
          ][class != i.class, .(co_occur = sum(weight * N * i.N)), by = .(class, i.class)]

dt3 <- dcast(dt2, class ~ i.class, fill = 0, value.var = 'co_occur')

mat_ans <- as.matrix(dt3[,-1])

rownames(mat_ans) = colnames(mat_ans)
diag(mat_ans) <- 1L

And here's a bonus way using xtabs

dt <- setkey(as.data.table(DF)[, .N, by = .(ID, class, weight)], ID)

dt_mat <- xtabs(co_occur ~ i.class + class,
                data = dt[dt, .(class, i.class, co_occur = weight*N*i.N), allow.cartesian = T]
                )

diag(dt_mat) <- 1L

Performance:

Unit: milliseconds
        expr     min        lq       mean    median       uq      max neval
     cole_dt  9.7538  10.36345  10.966212  10.84040  11.1854  15.8167   100
   cole_tidy  5.5976   5.79765   6.221044   5.96675   6.1522  10.0465   100
  cole_xtabs  6.2134   6.65480   7.062921   6.94780   7.2503  13.9981   100
 marian_tidy 95.9504 100.08345 103.244376 101.95380 104.7970 125.7495   100

Data:

DF <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 15L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 18L, 18L, 18L, 19L, 19L, 20L, 20L, 20L, 20L)
                     , class = c("A", "A", "B", "A", "B", "C", "B", "B", "C", "C", "D", "D", "A", "B", "C", "D", "B", "B", "C", "C", "D", "D", "C", "D", "D", "E", "E", "C", "C", "C", "E", "A", "A", "A", "A", "B", "C", "B", "D", "D", "E", "D", "E", "B", "B", "C", "C", "D", "C", "D", "E", "E", "B", "C", "C", "D", "D", "E", "E", "E", "B", "D", "E", "E")
                     , weight = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.9, 0.9, 0.9, 0.9, 0.8, 0.8, 0.7, 0.7, 0.7, 0.7, 0.6, 0.6, 0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1))
                , row.names = c(NA, -64L)
                , class = c("data.table", "data.frame")
)
0
votes

I've tried shorting Marian's solution but have only gotten through the first two parts. It uses data.table which it appears you are already using.

dt <- data[, `:=` (Count = .N), by = list(ID, class)] %>%
  dcast(., ID + weight ~ class, value.var = "Count")

eg.dt <- merge(unique(data$class), unique(data$class), all = TRUE) %>%
  setnames(., c("x", "y"), c("Var1", "Var2"))

Not a big reduction in the code. If I come up with more I'll update.