2
votes

I have a dataframe with duplicated id's in the first column and different values in the subsequent columns. I would like to truncate this data to have only one record for each unique id, and the values in the subsequent columns be the sum of these values. However, I can do this with dplyr::summarise, but if I use na.rm=TRUE, it replaces NA's with 0 (if all the records were NA) or if I use it without na.rm=TRUE, then it sums it to NA (if there was a NA present).

How can I get it to retain the NA as the new value if all the values were NA, and the sum if there were numeric values with an NA.

Apologies for the bad explanation. Not sure how to better word it.

A mock dataframe would look like this:

    df <- structure(
      list(
        id = structure(
          c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 5L, 6L, 7L, 7L), 
          .Label = c("a", "b", "c", "d", "e", "f", "g"), 
          class = "factor"
        ), 
        `1` = c(NA, NA, NA, 1, 1, 0, 1, 1, 0, 1, NA, 1, NA, 0, 1, 0),
        `2` = c(NA, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, NA, 0), 
        `3` = c(NA, 1, 1, 0, 1, 1, 0, 1, 0, 1, NA, 1, 0, 0, NA, NA)
      ), 
      row.names = c(NA, -16L), 
      class = "data.frame"
    )

which would print out looking like this:

> df
   id  1  2  3
1   a NA NA NA
2   a NA  0  1
3   a NA  1  1
4   b  1  0  0
5   b  1  1  1
6   c  0  0  1
7   c  1  1  0
8   c  1  0  1
9   c  0  1  0
10  c  1  1  1
11  c NA  0 NA
12  d  1  1  1
13  e NA  0  0
14  f  0  0  0
15  g  1 NA NA
16  g  0  0 NA

I would like to group by the 'id' column, and then sum it to get something like this:

  id  1 2  3
1  a NA 1  2
2  b  2 1  1
3  c  3 3  3
4  d  1 1  1
5  e NA 0  0
6  f  0 0  0
7  g  1 0 NA

I have tried using summarise with and without na.rm=T but it does not provide what I need.

    df %>% 
      group_by(
        id
      ) %>% 
      summarise_at(
        c(
          1,2,3
        ),
        sum,
        na.rm = T
      ) 

# A tibble: 7 x 4
  id      `1`   `2`   `3`
  <fct> <dbl> <dbl> <dbl>
1 a         0     1     2
2 b         2     1     1
3 c         3     3     3
4 d         1     1     1
5 e         0     0     0
6 f         0     0     0
7 g         1     0     0

Without na.rm = T:

    
    df %>% 
      group_by(
        id
      ) %>% 
      summarise_at(
        c(
          1,2,3
        ),
        sum
      ) 

# A tibble: 7 x 4
  id      `1`   `2`   `3`
  <fct> <dbl> <dbl> <dbl>
1 a        NA    NA    NA
2 b         2     1     1
3 c        NA     3    NA
4 d         1     1     1
5 e        NA     0     0
6 f         0     0     0
7 g         1    NA    NA

I am not sure what else to try. Any advice would be greatly appreciated. Thank you very much

2

2 Answers

1
votes

You can check for the values in each id and if all the values are NA return NA.

library(dplyr)

df %>% 
  group_by(id) %>% 
  summarise(across(`1`:`3`, ~if(all(is.na(.))) NA else sum(., na.rm = TRUE)))
  #summarise_at(vars(`1`:`3`), ~if(all(is.na(.))) NA else sum(., na.rm = TRUE))

#   id      `1`   `2`   `3`
#  <fct> <dbl> <dbl> <dbl>
#1 a        NA     1     2
#2 b         2     1     1
#3 c         3     3     3
#4 d         1     1     1
#5 e        NA     0     0
#6 f         0     0     0
#7 g         1     0    NA
1
votes

We can use

library(dplyr)
df %>%
    group_by(id) %>%
    summarise(across(-id, ~ if(sum(is.na(.)) == n() NA else sum(., na.rm = TRUE)))