3
votes

I'm trying to find the total count of how many users in df1 meet conditions specified in df2, but keep getting an error message.

df1 looks like this:

    id  step1          step2
    1   session_start  NA
    2   session_start  NA
    3   session_start  sign_up
    4   session_start  sign_up
    5   session_start  sign_up
    6   sign_up        session_start

df1 <- Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':  6 obs. of  3 variables:
    $ id   : chr  "1" "2" "3" "4" ...
    $ step1: chr  "session_start" "session_start" "session_start" "session_start" ...
    $ step2: chr  NA NA "sign_up" "sign_up" ..

df2 looks like this:

    step1          step2         count
    session_start  sign_up       0
    sign_up        in_screen     0
    in_screen      click_banner  0
    session_stop   session_stop  0

df2 <- structure(c("session_start", "sign_up", "0", "sign_up", 
"in_screen", "0", "in_screen", "click_banner", "0", "session_stop", 
"session_stop", "0", .Dim = c(3L, 4L), .Dimnames = list(c("step1", "step2", 
"count"), NULL))

In the column df2$count, I would like to show how many (sum of) users completed df2$step1 and df2$step2 in that order. In the sample code above, the first row in df2$count would output 3, as 3 users in df1 completed session_start as df1$step1 and sign_up as df1step2.

Previous Attempts

When I try doing this manually using this code, everything works fine:

count <- sum(df1$step1 == "session_start" & df1$step2 == "sign_up", na.rm = TRUE)

However, when I replace "session_start" and "sign_up" with dynamic values, I get the error that "Error in test8$step1 : $ operator is invalid for atomic vectors":

df2$count <- sum(df1$step1 == df2$step1 & df1$step2 == df2$step2, na.rm = TRUE)

I tried replacing "$" with "[]", but still get "Error: Columns session_start, sign_up, in_screen, click_banner, session_stop not found":

df2[count] <- sum(df1[step1] == df2[step1] & df1[step2] == df2[step2], na.rm = TRUE)

Ideal response:

I'd love to be able to add the extra column to the data as shown below. Are you able to help? Many thanks in advance if so!

    step1          step2         count
    session_start  sign_up       3
    sign_up        in_screen     0
    in_screen      click_banner  0
    session_stop   session_stop  0
2

2 Answers

3
votes

Here is a tidyverse solution.

library(tidyverse)

df2 %>%
  group_by(step1, step2) %>%
  mutate(count = sum(step1 == df1$step1 & step2 == df1$step2, na.rm = TRUE))
## A tibble: 4 x 3
## Groups:   step1, step2 [4]
#  step1         step2        count
#  <chr>         <chr>        <int>
#1 session_start sign_up          3
#2 sign_up       in_screen        0
#3 in_screen     click_banner     0
#4 session_stop  session_stop     0

Note that instead of mutate you can also use summarise but the order of the output rows will be different.

3
votes

You could use mapply and calculate number of step1 and step2 values completed in df1.

df2$count <- mapply(function(x, y) 
    sum(df1$step1 == x & df1$step2 == y, na.rm = TRUE), df2$step1, df2$step2)


df2
#          step1        step2 count
#1 session_start      sign_up     3
#2       sign_up    in_screen     0
#3     in_screen click_banner     0
#4  session_stop session_stop     0

data

df1 <- structure(list(id = c("1", "2", "3", "4", "5", "6"), 
step1 = c("session_start", "session_start", "session_start", 
 "session_start", "session_start", 
 "sign_up"), step2 = c(NA, NA, "sign_up", "sign_up", "sign_up", 
"session_start")), .Names = c("id", "step1", "step2"), row.names = c(NA, 
-6L), class = "data.frame")

df2 <- structure(list(step1 = c("session_start", "sign_up", "in_screen", 
"session_stop"), step2 = c("sign_up", "in_screen", "click_banner", 
"session_stop")), .Names = c("step1", "step2"), row.names = c(NA, 
-4L), class = "data.frame")