2
votes

I have two large data frames, which I have associated using a "1" to indicate a match between them. For example:

Name<-c("N1", "N2", "N3")
A1<-c(1, NA, 1)
A2<-c(NA, 1, NA)
A3<-c(NA, 1, 1)
df1<-data.frame(Name, A1, A2, A3)

> df1
  Name A1 A2 A3
1   N1  1 NA NA
2   N2 NA  1  1
3   N3  1 NA  1

Var<-c("A1", "A2", "A3")
Detail<-c("Red", "Red, Blue", "Green, Red")
df2<-data.frame(Var, Detail)

> df2
  Var     Detail
1  A1        Red
2  A2  Red, Blue
3  A3 Green, Red

I am trying to aggregate all details associated with each variable from df2 with each of the names marked with a "1" for that particular variable from df1. In the end, I hope to have all unique entries of "Detail" as individual columns with 1 if a name is associated with them. df3 would be an example of the desired result.

Red<-c(1,1,1)
Blue<-c(0,1,0)
Green<-c(0,1,1)
df3<-data.frame(Name, Red, Blue, Green)

> df3
  Name Red Blue Green
1   N1   1    0     0
2   N2   1    1     1
3   N3   1    0     1

After going through similar questions and trying for a LONG time, this is what I was trying to do:

I replaced cells with "1" in df1 with column name so that they are able to be matched with the corresponding rows in df2

ones <- which(df1=="1", arr.ind=T)
df1[ones]<-colnames(df1)[ones[,2]]
df1;

I know it is preferable to avoid loops, but i can't think of another way. I am trying to cycle through the columns and rows to compare them with the Var names from df2 and copy the corresponding details into that cell back into df1 using "which". But I keep getting error messages...

for(i in 2:ncol(df1)){
  for(j in 1:nrow(df1)){
  + tmp <- df2[which(df2[,1]==df1[j,i]),]
  + df1[j,i] <- tmp[which(tmp[j,]==df1[j,i]),4]
    }
  }

If I try to replace it individually I don't get the desired result either

tmp <- df2[which(df2[,1]==df1[1,2]),]
df1[1,2] <- tmp[which(tmp[1,1]==df1[1,2]),2]

And this is where I am stuck. After this step I would aggregate all details into a single comma separated string in a single cell, make a list of all unique Details and turned them each into a column, then made another loop to add "1"s to each instance where a part of the string with the the details matches the new column headers.

There is probably a simpler way that avoids half (or all of) these steps, apologies if this is a basic question. Any advice is greatly appreciated!

2

2 Answers

4
votes

Using the tidyverse suite of packages.

library(tidyverse)
Name<-c("N1", "N2", "N3")
A1<-c(1, NA, 1)
A2<-c(NA, 1, NA)
A3<-c(NA, 1, 1)
df1<-data.frame(Name, A1, A2, A3)

Var<-c("A1", "A2", "A3")
Detail<-c("Red", "Red, Blue", "Green, Red")
df2<-data.frame(Var, Detail)

# The pipe operator (%>%) makes df1 the first argument of the next function.
# It lets us look at this "in order" not nested
df3 <- df1 %>%  
  gather(Var, value, -Name) %>%
  merge(df2) %>%
  filter(!is.na(value)) %>%
  separate(Detail, into = paste0("color",1:2), fill = "right") %>%
  gather(value, color, color1:color2) %>%
  select(-value) %>%
  filter(!is.na(color)) %>%
  mutate(
    one = 1L
  ) %>%
  spread(color, one, fill = 0) %>%
  group_by(Name) %>%
  summarize_at(vars(Red, Blue, Green),
    ~ as.integer(sum(.) > 0))

If you aren't familiar with the pipe operator or the other functions I'm using, I would call this one piece at a time so you can see what it's doing (and it's all documented at https://www.tidyverse.org/). For example,

df1 %>%
 gather(Var, value, -Name)

That particular piece is a lot like reshape(df1, direction = 'long',...), but it's a lot simpler for me to remember. If your full data set has more than 2 colors, you need to make adjustments to the paste0 and gather commands. I'm not sure if there is an equivalent to "separate" that will add as many numbers as needed for all the colors, but so long as you know, it shouldn't be an issue. If not, you can get it by doing

num_colors <- max(sapply(strsplit(as.character(df2$Detail), ", "), length))

or (to demonstrate how much pipes improve legibility):

num_colors <- df2$Detail %>%
  as.character() %>%
  str_split(", ") %>% # almost the same as strsplit, but within a tidyverse package
  sapply(length) %>%
  max()
2
votes

It is similar to Melissa Key's answer. With the help of cSplit function from the package, we do not have to know how many colors are in the Detail column we need to split.

library(tidyverse)
library(splitstackshape)

df3 <- df1 %>%
  gather(Var, value, -Name) %>%
  left_join(df2, by = "Var") %>%
  drop_na(value) %>%
  cSplit(splitCols = "Detail", sep = ",") %>%
  gather(Detail, Color, starts_with("Detail"), na.rm = TRUE) %>%
  group_by(Name, Color) %>%
  summarise(value = max(value)) %>%
  spread(Color, value, fill = 0) %>%
  ungroup()

df3
# # A tibble: 3 x 4
#   Name   Blue Green   Red
#   <fct> <dbl> <dbl> <dbl>
# 1 N1       0.    0.    1.
# 2 N2       1.    1.    1.
# 3 N3       0.    1.    1.