summarize_all rows by grouping and define which value should be kept

Question

I have a data frame in which several data sources are merged. This creates rows with the same id. Now I want to define which values from which row should be kept.

So far I have been using dplyr with group_by and summarize all to keep the first value if it is not NA.

Here's an example:

# function f for summarizing
f <- function(x) {
            x <- na.omit(x)
            if (length(x) > 0) first(x) else NA
          }
# test data
test <- data.frame(id = c(1,2,1,2), value1 = c("a",NA,"b","c"), value2 = c(0:4))

  id value1 value2
  1      a      0
  2   <NA>      1
  1      b      2
  2      c      3

The following result is obtained when merging

test <- test %>% group_by(id) %>% summarise_all(funs(f))
id value1 value2
1 a           0
2 c           1

Now the question: that NA (na.omit) be replaced already works, but how can I define that not the numerical value 0, but the value not equal to 0 is accepted. So the expected result looks like this:

id value1 value2
1 a           2
2 c           1

Ric S Ric S · Accepted Answer · 2021-05-17T12:49:18

You can just modify your f function by subsetting the vector where it is different from zero

f <- function(x) {
  x <- na.omit(x)
  x <- x[x != 0]
  if (length(x) > 0) first(x) else NA
}

Sidenote: as of dplyr 0.8.0, funs is deprecated. You should a lambda, a list of functions or a list of lambdas. In this case I used a single lambda:

test %>%
  group_by(id) %>%
  summarise_all(~f(.))

# A tibble: 2 x 3
     id value1 value2
  <dbl> <chr>   <int>
1     1 a           2
2     2 c           1

summarize_all rows by grouping and define which value should be kept

3 Answers