1
votes

I would like to ask the R community how to merge two rows with the same ID (i.e. same participant) with some variables that are identical and others where there are NA's. In my example, I would like all the values 4-5-6 to appear on one row and therefore for the NA's (or empty cells) to be gone.

This is an example of what my data looks like

I have tried using dplyr without much success, and I have to do the merging by hand (which is quite time consuming and increases the risk for errors). Thank you in advance for your help with this problem!

4
Please do not use images. Provide sample input data as R code that can be copied and pasted into R to reproduce the input. I have done this for you this time in the Note at the end of my answer. See minimal reproducible example for more info.G. Grothendieck
Possible duplicate of Merging rows with shared informationjdobres
@G.Grothendieck Okay, thank you for the info.Elmo556

4 Answers

3
votes
# Create sample data frame.
id <- c(rep('Participant 1', 2), rep('Participant 2', 2))
value1 <- rep('A', 4)
value2 <- rep('B', 4)
value3 <- rep('C', 4)
value4 <- c('x', NA, NA, 'x')
value5 <- c('x', NA, 'x', NA)
value6 <- c(NA, 'x', NA, 'x')

df <- data.frame(id, value1, value2, value3, value4, value5, value6, stringsAsFactors = F)

# Use dplyr to group the data and keep the non-NA value from the other columns.
df %>% group_by(id, value1, value2, value3) %>%
       summarise(value4 = max(value4, na.rm = T),
                 value5 = max(value5, na.rm = T),
                 value6 = max(value6, na.rm = T))
2
votes

1) Using DF defined in the Note below try aggregating using the compress function defined below. This function removes NA values and appends an NA just in case all values were removed and then takes the first of what is left. No packages are used.

compress <- function(x) c(na.omit(x), NA)[1]
aggregate(DF[5:7], DF[1:4], compress)

giving:

  ID Value1 Value2 Value3 Value4 Value5 Value6
1  1      A      B      C      x      x      x
2  2      A      B      C      x      x      x

2) A simpler alternative if no participant has all NA values in any column is that we could eliminate the definition of compress and use max with na.rm = TRUE instead like this:

aggregate(DF[5:7], DF[1:4], max, na.rm = TRUE)

Note: The input in reproducible form:

Lines <- "ID Value1 Value2 Value3 Value4 Value5 Value6
1 A B C x x NA
1 A B C NA NA x
2 A B C NA x NA
2 A B C x NA x"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
2
votes

Another solution with dplyr and tidyr:

library(dplyr)
library(tidyr)
DF %>% 
  gather(var, val, Value4:Value6) %>% 
  filter(!is.na(val)) %>% 
  spread(var, val)

using the data of @G.Grothendieck, this results in:

  ID Value1 Value2 Value3 Value4 Value5 Value6
1  1      A      B      C      x      x      x
2  2      A      B      C      x      x      x

Or another variation with summarise_each with the max approach of @G.Grothendieck:

DF %>% 
  group_by(ID, Value1, Value2, Value3) %>%
  summarise_each(funs(max(., na.rm = TRUE)))

The gather and spread options can also be translated into a solution with reshape2:

library(reshape2)
dcast(na.omit(melt(DF, id.vars = c('ID','Value1','Value2','Value3'))),
      ID + Value1 + Value2 + Value3 ~ variable,
      value.var = 'value')
1
votes

If you prefer to use dplyr try:

library(dplyr)
DF %>%
    group_by(ID, Value1, Value2, Value3) %>%
    summarise_each(funs(toString(na.omit(.))))

Result:

     ID Value1 Value2 Value3 Value4 Value5 Value6
  <int>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>
1     1      A      B      C      x      x      x
2     2      A      B      C      x      x      x

Note: