I am trying to write a decently complex iterative matching function but I am drowning in ifelse and for functions that do not work. Unfortunately I don't have anyone to bounce ideas off so any support or thoughts are appreciated.
My Data Structure
Each row of my data is an observation with many variables, pertinent ones are included in this example. The observation has an assigned Sample_Name
, a Matching_Group
corresponding to the sample name, measurements of Time
, and a subjective Assigned_idx
which is partially complete from earlier parts in the data cleaning. Each observed Sample_Name
can contain 0-7 observations, but the Matching_Group
will always contain 7 observations.
structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS",
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ",
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ",
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS",
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395,
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6,
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2,
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"))
Sample_Name Matching_Group Time Assigned_idx
A QQ 1.000
A QQ 1.100
A QQ 1.200
A QQ 1.400
A QQ 1.600
B SS 7.203
B SS 7.395
B SS 7.500
B SS 7.600
B SS 7.700
B SS 7.802
QQ QQ 1.000 1
QQ QQ 1.102 2
QQ QQ 1.200 3
QQ QQ 1.300 4
QQ QQ 1.398 5
QQ QQ 1.501 6
QQ QQ 1.600 7
SS SS 7.200 1
SS SS 7.300 2
SS SS 7.400 3
SS SS 7.500 4
SS SS 7.600 5
SS SS 7.700 6
SS SS 7.800 7
My Problem
For each observation (row), I want to calculate a ratio of Time
between each row of the corresponding Matching_Group
. Each Matching_Group
will have an assigned unique Time_Ratio
value that the calculation needs to be equal to +/- some tolerance. If that calculated ratio matches the pre-defined ratio specific to the group, I want to extract and assign the Assigned_idx
from the row belonging to the observations of the Matching_Group
and assign it to the observation. If not, repeat calculation with the same observed Time
and the Time
in the next row of Matching_Group
. Repeat until every observation has a value in Assigned_idx
.
Example: In this data set, for both Matching_Group
the Time_Ratio
should be equal to 1.000 +/- 0.0020
. In my real data set, there will be unique Time_Ratio
values per Matching_Group
designated in a separate table. So for Row 3 with Time
= 1.200
, the Matching_Group
is QQ
. When we calculate the ratio with the first QQ
observed time, 1.200/1.000 = 1.200
which is outside of our defined tolerance --> next observed time of QQ
. 1.200/1.102 = 1.089
...again outside of our tolerance. Finally though, 1.200/1.200 = 1.000
which indeed falls within our designated tolerance for this Matching_Group
. In the row of the observation for Matching_Group
that has the matching ratio, the Assigned_idx
column holds 3
. We take this value, and map it into the Assigned_idx
column for Row 3. Then repeat this for Row 4 and iterate the process.
Desired outcome:
Sample_Name Matching_Group Time Assigned_idx Time_Ratio (Sample:Matching)
A QQ 1.000 1 1.0000
A QQ 1.100 2 0.9982
A QQ 1.200 3 1.0000
A QQ 1.400 5 1.0014
A QQ 1.600 7 1.0000
B SS 7.203 1 1.0004
B SS 7.395 3 0.9993
B SS 7.500 4 1.0000
B SS 7.600 5 1.0000
B SS 7.700 6 1.0000
B SS 7.802 7 1.0003
QQ QQ 1.000 1 1.0000
QQ QQ 1.102 2 1.0000
QQ QQ 1.200 3 1.0000
QQ QQ 1.300 4 1.0000
QQ QQ 1.398 5 1.0000
QQ QQ 1.501 6 1.0000
QQ QQ 1.600 7 1.0000
SS SS 7.200 1 1.0000
SS SS 7.300 2 1.0000
SS SS 7.400 3 1.0000
SS SS 7.500 4 1.0000
SS SS 7.600 5 1.0000
SS SS 7.700 6 1.0000
SS SS 7.800 7 1.0000
I have tried approaching this using dplyr as I figured it should be able to handle what I am trying to accomplish (perhaps purrr is better suited?). Unfortunately I just can't seem to sequence my conditions and my expressions appropriately within ifelse and for functions. My attempts have included a mishmash of grouping %>% mutate with the ratio calculation, data.table::shift, etc. but I just can't seem to get it to work with my condition parameters. Also in case it is pertinent, in my real data there will be ~50 "Names" and ~25 matching groups. I will have a second data source listing the matching group names and respective ratio but did not include such detail in this example.
I'm honestly stumped, any thoughts are appreciated.
case_when()
from thedplyr
package - GordonShumwayVariable
values? Does "observation group's name" refer toName
, and "another column that designates a different group to match with" refer toRelative Group
? Can you give an example of a ratio match failure and then "extract and assign" happening based on a subsequent row? In general, providing example-based clarifications along the way will make it a lot easier to understand your problem. - andrew_reececase_when()
later today and see if I can make progress. - Bill Z