How to merge two data frames and fill with different options based on coincidences

Question

I have two data frame y and z

y <- data.frame(ID = c("A", "A", "A", "B", "B"), gene = c("a", "b", "c", "a", "c"))
z <- data.frame(A = c(2,6,3), B = c(8,4,9), C=c(1,6,2))
rownames(z) <- c("a", "b", "c")

So for y I have a table with patients ID and gene for each patients and in Z I have the same patients IDs in the first row and a list of genes with a specific value (which is not important here). The genes in y are in z, but in z there are genes that are not included in y. What I want to do is to merge this frames and have something like this:

       a   b   c
A      1   1   1 
B      1   0   1

So for each patient, if the genes in z are also in y, fill with 1 and if not, fill with 0

I don't really know how to handle this, any ideas? Thank you

Do the A, B, C columns in z have any relation to the result? — Gregor Thomas
When asking a question, write reproducible examples (something we can easily cut and paste in our own R session). You want us to have easy acces to your y and z object. Look at the function dput, it might help. — Bastien
That being said, you'll want to cast or spread your data, merge it, and cast or spread it again. To do so, if you're a tidyverse fan, try tidyr::pivot_wider or tidyr::pivot_longer. If you're more old school, reshape2::dcast or reshape2::melt. — Bastien
Thank you for your comment and suggestion about reproducibility, of course you are totally right and I will definitely consider it next time. Thank you for your time and effort! — lana

Bastien Bastien · Accepted Answer · 2021-03-08T17:01:10

I've made RE from your question (add this to your question next time):

y <- data.frame(ID = c("id_A", "id_A", "id_A", "id_B", "id_B"), gene = c("a", "b", "c", "a", "c"))
z <- data.frame(id_A = c(2,6,3), id_B = c(8,4,9), id_C=c(1,6,2))
rownames(z) <- c("a", "b", "c")

The idea here is to pivot_longer your tables so you can join than easily.

To do so, you first need to make your rownames into a field:

z <- tibble::rownames_to_column(z, "gene")

Then, you pivot longer you z table:

library(tidyr)
z_long <- pivot_longer(z, starts_with("id_"), names_to = "ID")

and join it with your y table:

library(dplyr)
table_join <- left_join(y, z_long)

Finally, you just have to calculate the frequencies:

table(table_join$ID, table_join$gene)
   
       a b c
  id_A 1 1 1
  id_B 1 0 1

How to merge two data frames and fill with different options based on coincidences

1 Answers