R: Extract Rows from One Data Frame, Based on Column Names Matching Values from Another Data Frame

Question

I would like to know how to extract the values in one column of a data frame (data frame A) based on certain column names in data frame A containing the values of multiple columns from another data frame (data frame B).

More specifically. I have two data frames:

Data frame A contains combinations of birth defects. Each row is a different combination, and each column is the number of a defect included in that combination.

# Combinations data frame 
combos <- data.frame("combo_no"=c(1:4),
                     "Defect_A" = c(1,1,1,1),
                     "Defect_B" = c(3,2,3,4),
                     "Defect_C" = c(4,4,NA,7),
                     "Defect_D" = c(5,5,NA,8),
                     "Defect_E" = c(6,6,NA,NA))

Data frame B contains individual cases. The first column has a unique identifier (CASE_ID). The rest of the columns are the numbers of specific birth defects, with “1” for “birth defect present” and “0” for “not present.”

# Cases data set 
set.seed(99)
CASE_ID = c(1001:1005)
case1 = sample(0:1, 10, replace=TRUE)  
case2 = sample(0:1, 10, replace=TRUE)  
case3 = sample(0:1, 10, replace=TRUE)  
case4 = sample(0:1, 10, replace=TRUE)  
case5 = sample(0:1, 10, replace=TRUE)  
def<-data.frame(rbind(case1, case2, case3, case4, case5))
colnames(def)<- c(1:10)
cases<-cbind(CASE_ID,def)

Desired Output: I would like to obtain the list of CASE_IDs from data frame A that have a combination of birth defects from data frame B. I’d also like to specify which combination is present. Ideally, the output would look as follows:

# Desired Output
output <- data.frame("CASE_ID" = c(1002,1003),
                     "combo_no" = c(3,1))

Thank you for your help.

do you mean that each row of the first DB contains the info about which combination of numbers of the second DB compose the combination? As example, "combo_no: 1" is composed of child with 1,3,4,5,6 in the second table? — cccnrc
yes, combo_no:1 is composed of defects 1, 3, 4, 5, and 6. Child 1 (case_ID: 10001) has this combo. In the full data set, multiple case ids could have this combo. — Lulis Navarro

cccnrc cccnrc · Accepted Answer · 2019-06-17T18:32:37

Here the solution, long in order to comment it step by step:

### my random generated cases DF:
cases
      CASE_ID 1 2 3 4 5 6 7 8 9 10
case1    1001 1 0 1 1 1 1 1 0 0  0
case2    1002 1 1 0 1 1 1 0 0 0  0
case3    1003 0 0 1 1 1 0 0 1 0  0
case4    1004 1 0 0 1 0 0 1 1 1  1
case5    1005 1 0 1 1 0 1 0 0 1  0

### initialize vectors to store found results
found_combos <- vector(); found_patients <- vector();
### open loop on combos rows
for (i in 1:nrow(combos)) {
  ### open empty vector to fill with the numbers that compose the defect
  defect_numbers <- vector()
  ### open loop on column and take the numbers
  for (col in colnames(combos)[2:length(colnames(combos))]) {
    number <- combos[i, col]
    if ( !is.na(number) ) defect_numbers <- append(defect_numbers, number)
  }
  ### sort the vector to avoid mismatch based on order
  defect_numbers <- sort( defect_numbers )
  ### open loop on patients table
  for ( pz in 1:nrow(cases) ) {
    pz_numbers <- sort( which( cases[pz,] == 1 )-1 )
    ### first condition: same length
    if ( length(pz_numbers) == length(defect_numbers) ) {
      ### same condition: exacly same numbers
      if (all(pz_numbers == defect_numbers)) {
        ### append to found results vectors
        found_patients <- append( found_patients, cases[pz,1] )
        found_combos <- append( found_combos, i )
      }
    }
  }
}

output <- data.frame("CASE_ID" = found_patients,
                     "combo_no" = found_combos)

### result:
output
  CASE_ID combo_no
1    1002        2

EDITED based on your comment:

just change the conditions from equals to %in% :

### initialize vectors to store found results
found_combos <- vector(); found_patients <- vector();
for (i in 1:nrow(combos)) {
  ### open empty vector to fill with the numbers that compose the defect
  defect_numbers <- vector()
  ### open loop on column and take the numbers
  for (col in colnames(combos)[2:length(colnames(combos))]) {
    number <- combos[i, col]
    if ( !is.na(number) ) defect_numbers <- append(defect_numbers, number)
  }
  ### sort the vector to avoid mismatch based on order
  defect_numbers <- sort( defect_numbers )
  ### open loop on patients table
  for ( pz in 1:nrow(cases) ) {
    pz_numbers <- sort( which( cases[pz,] == 1 )-1 )
    ### only condition: all defect_numbers in combo_numbers vector
    if (all(defect_numbers %in% pz_numbers)) {
      ### append to found results vectors
      found_patients <- append( found_patients, cases[pz,1] )
      found_combos <- append( found_combos, i )
    }
  }
}

output <- data.frame("CASE_ID" = found_patients,
                     "combo_no" = found_combos)

R: Extract Rows from One Data Frame, Based on Column Names Matching Values from Another Data Frame

1 Answers