Complex ID Assignment with conditions, iterative calculations, and tolerance matching

Question

I am trying to write a decently complex iterative matching function but I am drowning in ifelse and for functions that do not work. Unfortunately I don't have anyone to bounce ideas off so any support or thoughts are appreciated.

My Data Structure

Each row of my data is an observation with many variables, pertinent ones are included in this example. The observation has an assigned Sample_Name, a Matching_Group corresponding to the sample name, measurements of Time, and a subjective Assigned_idx which is partially complete from earlier parts in the data cleaning. Each observed Sample_Name can contain 0-7 observations, but the Matching_Group will always contain 7 observations.

structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B", 
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", 
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ", 
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ", 
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", 
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395, 
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6, 
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2, 
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df", 
"tbl", "data.frame"))


Sample_Name Matching_Group  Time        Assigned_idx

A           QQ              1.000   
A           QQ              1.100   
A           QQ              1.200   
A           QQ              1.400   
A           QQ              1.600   
B           SS              7.203   
B           SS              7.395   
B           SS              7.500   
B           SS              7.600   
B           SS              7.700   
B           SS              7.802   
QQ          QQ              1.000       1
QQ          QQ              1.102       2
QQ          QQ              1.200       3
QQ          QQ              1.300       4
QQ          QQ              1.398       5
QQ          QQ              1.501       6
QQ          QQ              1.600       7
SS          SS              7.200       1
SS          SS              7.300       2
SS          SS              7.400       3
SS          SS              7.500       4
SS          SS              7.600       5
SS          SS              7.700       6
SS          SS              7.800       7

My Problem

For each observation (row), I want to calculate a ratio of Time between each row of the corresponding Matching_Group. Each Matching_Group will have an assigned unique Time_Ratio value that the calculation needs to be equal to +/- some tolerance. If that calculated ratio matches the pre-defined ratio specific to the group, I want to extract and assign the Assigned_idx from the row belonging to the observations of the Matching_Group and assign it to the observation. If not, repeat calculation with the same observed Time and the Time in the next row of Matching_Group. Repeat until every observation has a value in Assigned_idx.

Example: In this data set, for both Matching_Group the Time_Ratio should be equal to 1.000 +/- 0.0020. In my real data set, there will be unique Time_Ratio values per Matching_Group designated in a separate table. So for Row 3 with Time = 1.200, the Matching_Group is QQ. When we calculate the ratio with the first QQ observed time, 1.200/1.000 = 1.200 which is outside of our defined tolerance --> next observed time of QQ. 1.200/1.102 = 1.089...again outside of our tolerance. Finally though, 1.200/1.200 = 1.000 which indeed falls within our designated tolerance for this Matching_Group. In the row of the observation for Matching_Group that has the matching ratio, the Assigned_idx column holds 3. We take this value, and map it into the Assigned_idx column for Row 3. Then repeat this for Row 4 and iterate the process.

Desired outcome:

Sample_Name Matching_Group  Time        Assigned_idx    Time_Ratio (Sample:Matching) 

A           QQ              1.000       1               1.0000
A           QQ              1.100       2               0.9982
A           QQ              1.200       3               1.0000
A           QQ              1.400       5               1.0014
A           QQ              1.600       7               1.0000
B           SS              7.203       1               1.0004
B           SS              7.395       3               0.9993
B           SS              7.500       4               1.0000
B           SS              7.600       5               1.0000
B           SS              7.700       6               1.0000
B           SS              7.802       7               1.0003
QQ          QQ              1.000       1               1.0000
QQ          QQ              1.102       2               1.0000
QQ          QQ              1.200       3               1.0000
QQ          QQ              1.300       4               1.0000
QQ          QQ              1.398       5               1.0000
QQ          QQ              1.501       6               1.0000
QQ          QQ              1.600       7               1.0000
SS          SS              7.200       1               1.0000
SS          SS              7.300       2               1.0000
SS          SS              7.400       3               1.0000
SS          SS              7.500       4               1.0000
SS          SS              7.600       5               1.0000
SS          SS              7.700       6               1.0000
SS          SS              7.800       7               1.0000

I have tried approaching this using dplyr as I figured it should be able to handle what I am trying to accomplish (perhaps purrr is better suited?). Unfortunately I just can't seem to sequence my conditions and my expressions appropriately within ifelse and for functions. My attempts have included a mishmash of grouping %>% mutate with the ratio calculation, data.table::shift, etc. but I just can't seem to get it to work with my condition parameters. Also in case it is pertinent, in my real data there will be ~50 "Names" and ~25 matching groups. I will have a second data source listing the matching group names and respective ratio but did not include such detail in this example.

I'm honestly stumped, any thoughts are appreciated.

It would help if you can add examples or nomenclature from your data when you're describing your problem. For example, is ratio a ratio between Variable values? Does "observation group's name" refer to Name, and "another column that designates a different group to match with" refer to Relative Group? Can you give an example of a ratio match failure and then "extract and assign" happening based on a subsequent row? In general, providing example-based clarifications along the way will make it a lot easier to understand your problem. — andrew_reece
@GordonShumway Thank you, I will read up on case_when() later today and see if I can make progress. — Bill Z
@andrew_reece Thanks for the helpful feedback, I have updated my post based on your comments and hopefully things are much clearer. — Bill Z

DavidArndt DavidArndt · Accepted Answer · 2018-07-19T21:33:27

Something like this should work:

#!/usr/bin/R

a = structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B", 
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", 
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ", 
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ", 
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", 
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395, 
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6, 
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2, 
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df", 
"tbl", "data.frame"));

tol = 0.002;

a$Time_Ratio <- NA;

for (i in 1:nrow(a)) {
  s_name <- a[i, "Sample_Name"];
  mg     <- a[i, "Matching_Group"];
  s_time <- a[i, "Time"];

  for (j in 1:nrow(a)) {
    mg_name <- a[j, "Sample_Name"];
    if (mg_name == mg) {
      mg_time <- a[j, "Time"];
      time_ratio = s_time/mg_time;
      if (abs(time_ratio - 1.0) < tol) {
        a[i, "Assigned_idx"] <- a[j, "Assigned_idx"];
        a[i, "Time_Ratio"] <- time_ratio;
        break;
      }
    }
  }
}

print(a);

Complex ID Assignment with conditions, iterative calculations, and tolerance matching

My Data Structure

My Problem

2 Answers