3
votes

I am trying to use sum function inside dplyr's mutate function. However I am ending up with unexpected results. Below is the code to reproduce the problem

chk1 <- data.frame(ba_mat_x=c(1,2,3,4),ba_mat_y=c(NA,2,NA,5))

I used the below code to create another column that sums up the above 2 columns

chk2 <- chk1 %>% dplyr::mutate(ba_mat=sum(ba_mat_x+ba_mat_y,na.rm = T))

I had used na.rm=T because I have NAs in variable ba_mat_y. The result I got is as below

        ba_mat_x ba_mat_y ba_mat
   1        1       NA     13
   2        2        2     13
   3        3       NA     13
   4        4        5     13

However, the expected result is

      ba_mat_x ba_mat_y ba_mat
1        1       NA     1
2        2        2     4
3        3       NA     3
4        4        5     9
3

3 Answers

1
votes

If we have multiple columns and want to sum only limited columns we can replace them with 0 and then add the columns

library(dplyr)

chk1 %>%
  mutate_at(vars(ba_mat_x,ba_mat_y), ~ replace(., is.na(.), 0)) %>%
  mutate(ba_mat = ba_mat_x + ba_mat_y)

#  ba_mat_x ba_mat_y ba_mat
#1        1        0      1
#2        2        2      4
#3        3        0      3
#4        4        5      9

We can also use replace_na from tidyr which does the same thing.

chk1 %>%
  mutate_at(vars(ba_mat_x, ba_mat_y), tidyr::replace_na, 0) %>%
  mutate(ba_mat = ba_mat_x + ba_mat_y)

If we want to use sum other option is to use purrr's pmap or pmap_dbl where we can now pass list of columns to add upon and then use sum.

chk1 %>%
  mutate(ba_mat = purrr::pmap_dbl(list(ba_mat_x, ba_mat_y), sum, na.rm = TRUE)) 

#  ba_mat_x ba_mat_y ba_mat
#1        1       NA      1
#2        2        2      4
#3        3       NA      3
#4        4        5      9

In this particular case, we can also use map2_dbl

chk1 %>%
  mutate(ba_mat = purrr::map2_dbl(ba_mat_x, ba_mat_y, sum, na.rm = TRUE))

This would also work since we have only two columns but in case if there are more columns it is better/safer to use pmap option.


Moreover, as far as your attempt is concerned it would work if you add rowwise to it. rowwise as the name suggests does all the operation in row-wise fashion.

chk1 %>% 
  rowwise() %>%
  mutate(ba_mat = sum(ba_mat_x, ba_mat_y, na.rm = T))

But rowwise is usually slower.

2
votes

You want rowSums, to get the sum of columns for each row.

> chk1 %>% dplyr::mutate(ba_mat = rowSums(., na.rm=T))

  ba_mat_x ba_mat_y ba_mat
1        1       NA      1
2        2        2      4
3        3       NA      3
4        4        5      9

Your other formulation (sum(ba_mat_x+ba_mat_y,na.rm = T))) means:

  • first perform chk1$ba_mat_x + chk1$ba_mat_y: 1 + NA, 2 + 2, 3 + NA, 4 + 5, which results in NA, 4, NA, 9
  • then take the sum(na.rm=T) of that, which is 13
2
votes

We can use rowSums from base R

chk1$ba_mat <- rowSums(chk1, na.rm = TRUE)
chk1
#   ba_mat_x ba_mat_y ba_mat
#1        1       NA      1
#2        2        2      4
#3        3       NA      3
#4        4        5      9

Or using tidverse while keeping the original columns as such

library(tidyverse)
chk1 %>% 
    mutate(ba_mat  = replace(., is.na(.), 0) %>%
                     reduce(`+`))
#    ba_mat_x ba_mat_y ba_mat
#1        1       NA      1
#2        2        2      4
#3        3       NA      3
#4        4        5      9

Or with case_when

chk1 %>% 
    mutate_if(is.numeric, list(new = ~case_when(is.na(.) ~ 0,
                TRUE ~ .))) %>%
    transmute(!!! rlang::syms(names(chk1)), ba_mat = ba_mat_x_new + ba_mat_y_new)
#   ba_mat_x ba_mat_y ba_mat
#1        1       NA      1
#2        2        2      4
#3        3       NA      3
#4        4        5      9