1
votes

Good morning all, this is my first time posting on stack overflow. Thank you for any help!

I have 2 dataframes that I am using to analyze stock data. One data frame has dates among other information, we can call it df:

df1 <- tibble(Key = c('a','b','c'), i =11:13, date= ymd(20110101:20110103))

The second dataframe also has dates and other important information.

df2 <-tibble(Answer = c('a','d','e','b','f','c'), j =14:19, date= ymd(20150304:20150309))

Here is what I want to do: For each row in df1, I need to:

-Find the date in df2 where, for when df2$answer is the same as df1$key, it is the closest to that row's date in df1.

-Then extract information for another column in that row in df2, and put it in a new row in df1.

The code i tried:

df1 %>%
 group_by(Key, i) %>%
 mutate(
`New Column` =  df2$j[
  which.min(subset(df2$date, df2$Answer== Key) - date)])

This has the result:

Key       i date       `New Column`
1 a        11 2011-01-01           14
2 b        12 2011-01-02           14
3 c        13 2011-01-03           14

This is correct for the first row, a. In df2, the closest date is 2015-03-04, for which the value of j is in fact 14.

However, for the second row, Key=b, I want df2 to subset to only look at dates for rows where df2$Answer = b. Therefore, the date should be 2015-03-07, for which j =17.

Thank you for your help!

Jesse

1
Rename the date columns so they don't share a name, left_join the data on "Answer" = "Key", arrange by the difference in the date columns, group_by Answer and slice the top row from each group (smallest date difference, because of the sorting).Gregor Thomas
Thanks for your help Gregor! Are you saying to left-join with df2 on left, df1 on right, like left_join(df2, df1, by = c("Answer" = "Key")) ? If I do it the other way it is not accepted since Key only has 3 values and Answer has 6Jesse Kerr
df1 %>% right_join(df2, by = c('Key' = 'Answer')) %>% group_by(Key) %>% arrange(abs(date.x - date.y)) %>% slice(1) Is Gregor's suggestion I believe (Or switch df2 and df1 and change right_join to left_join)IceCreamToucan
Hi Ryan and Gregor. The problem with this suggestion, while helpful, is that in the actual data, the dates are not exactly matching. If I try to join by Key= Answer, for each key there are many matching answers, and if do it by c("Key" = "Answer", "date = "date), the dates are not exactly matching, that is why I need to do which.min, to find the closest date to link them up, not the same date.Jesse Kerr
You're ignoring everything in my comment after the join. You join by Key = Answer, which does result in many matches. The rest of my comment tells you how to find the closest date and get rid of all the other rows: arrange, group_by, slice. This does the same thing as your which.min, but in an easier workflow.Gregor Thomas

1 Answers

2
votes

This should work:

library(dplyr)
df1 %>% 
  left_join(df2, by = c("Key" = "Answer")) %>% 
  mutate(date_diff = abs(difftime(date.x, date.y, units = "secs"))) %>% 
  group_by(Key) %>% 
  arrange(date_diff) %>% 
  slice(1) %>% 
  ungroup()

We are first joining the two data frames with left_join. Yes, I'm aware there are possibly multiple dates for each Key, bear with me.

Next, we calculate (with mutate) the absolute value (abs) of the difference between the two dates date.x and date.y.

Now that we have this, we will group the data by Key using group_by. This will make sure that each distinct Key will be treated separately in subsequent calculations.

Since we've calculated the date_diff, we can now re-order (arrange) the data for each Key, with the smallest date_diff as first for each Key.

Finally, we are only interested in that first, smallest date_diff for each Key, so we can discard the rest using slice(1).

This pipeline gives us the following:

  Key       i date.x         j date.y     date_diff
  <chr> <int> <date>     <int> <date>     <time>   
1 a        11 2011-01-01    14 2015-03-04 131587200
2 b        12 2011-01-02    17 2015-03-07 131760000
3 c        13 2011-01-03    19 2015-03-09 131846400