2
votes

I have a data frame in which several data sources are merged. This creates rows with the same id. Now I want to define which values from which row should be kept.

So far I have been using dplyr with group_by and summarize all to keep the first value if it is not NA.

Here's an example:

# function f for summarizing
f <- function(x) {
            x <- na.omit(x)
            if (length(x) > 0) first(x) else NA
          }
# test data
test <- data.frame(id = c(1,2,1,2), value1 = c("a",NA,"b","c"), value2 = c(0:4))

  id value1 value2
  1      a      0
  2   <NA>      1
  1      b      2
  2      c      3

The following result is obtained when merging

test <- test %>% group_by(id) %>% summarise_all(funs(f))
id value1 value2
1 a           0
2 c           1

Now the question: that NA (na.omit) be replaced already works, but how can I define that not the numerical value 0, but the value not equal to 0 is accepted. So the expected result looks like this:

id value1 value2
1 a           2
2 c           1
3

3 Answers

1
votes

You can just modify your f function by subsetting the vector where it is different from zero

f <- function(x) {
  x <- na.omit(x)
  x <- x[x != 0]
  if (length(x) > 0) first(x) else NA
}

Sidenote: as of dplyr 0.8.0, funs is deprecated. You should a lambda, a list of functions or a list of lambdas. In this case I used a single lambda:

test %>%
  group_by(id) %>%
  summarise_all(~f(.))

# A tibble: 2 x 3
     id value1 value2
  <dbl> <chr>   <int>
1     1 a           2
2     2 c           1
1
votes

You can write f function as :

library(dplyr)

f <- function(x) x[!is.na(x) & x != 0][1]

test %>% group_by(id) %>% summarise(across(.fns = f))

#     id value1 value2
#  <dbl> <chr>   <int>
#1     1 a           2
#2     2 c           1

Using [1] would return NA automatically if there are no non-zero or non-NA value in your data.

0
votes

As a sidenote to the sidenote of @RicS, as of dplyr v1+, summarise_all() is deprecated (superseded). You should rather use across():

test %>% 
  group_by(id) %>% 
  summarise(across(.f=f))