Edit string value based on value in another column using r

Question

I have data on women who married and sometimes changed surnames over the period 1990-1999. However, I do not always know the exact year the name change took place, only that the surname changed sometime between year x and year y. In the original data, the old surname has only been crossed over and the new surname has been written next to it, which is indicated in the column "crossed_over". For example, Sarah Smith changed her name to Sarah Draper sometime in the period 1994-1999.

What I would like is that each woman have a unique surname for each year, like Liza Moore who changed her name to Liza Neville, preferably taking an average value when assigning a surname, using the column "crossed_over". For example, Sarah Smith would become Sarah Draper in 1997 and Mary King would become Mary Fisher in 1997 or 1998.

Does anyone have a suggestion to how I can achieve this using the example below?

library(tidyverse)

id <- rep(1:4, each = 10)
year <- rep(1990:1999, 4)
first_name <- c(rep("molly", 10), rep("sarah", 10), rep("mary", 10), rep("liza", 10))
last_name <- c(rep("johnson", 10), rep("smith", 4), rep("smith draper", 6), rep("king", 5), rep("king fisher", 5), 
               rep("moore", 7), rep("neville", 3))
crossed_over <- c(rep(NA, 10), rep(NA, 4), rep("smith", 6), rep(NA, 5), rep("king", 5), rep(NA, 10))

df <- tibble(id, year, first_name, last_name, crossed_over)

Ben Ben · Accepted Answer · 2020-12-29T13:50:34

Here is one approach. For those rows with crossed_over names, set the new_last_name to the crossed_over name for the first half of rows, and to the difference between crossed_over and last_name for the second half of rows.

library(tidyverse)
library(stringr)

df %>%
  filter(!is.na(crossed_over)) %>%
  group_by(across(c(-year))) %>%
  mutate(new_last_name = ifelse(row_number() <= n()/2,
                                crossed_over,
                                str_trim(str_remove(last_name, crossed_over)))) %>%
  ungroup() %>%
  right_join(df) %>%
  mutate(new_last_name = coalesce(new_last_name, last_name)) %>%
  arrange(id, year)

Output

      id  year first_name last_name    crossed_over new_last_name
   <int> <int> <chr>      <chr>        <chr>        <chr>        
 1     1  1990 molly      johnson      NA           johnson      
 2     1  1991 molly      johnson      NA           johnson      
 3     1  1992 molly      johnson      NA           johnson      
 4     1  1993 molly      johnson      NA           johnson      
 5     1  1994 molly      johnson      NA           johnson      
 6     1  1995 molly      johnson      NA           johnson      
 7     1  1996 molly      johnson      NA           johnson      
 8     1  1997 molly      johnson      NA           johnson      
 9     1  1998 molly      johnson      NA           johnson      
10     1  1999 molly      johnson      NA           johnson      
11     2  1990 sarah      smith        NA           smith        
12     2  1991 sarah      smith        NA           smith        
13     2  1992 sarah      smith        NA           smith        
14     2  1993 sarah      smith        NA           smith        
15     2  1994 sarah      smith draper smith        smith        
16     2  1995 sarah      smith draper smith        smith        
17     2  1996 sarah      smith draper smith        smith        
18     2  1997 sarah      smith draper smith        draper       
19     2  1998 sarah      smith draper smith        draper       
20     2  1999 sarah      smith draper smith        draper       
21     3  1990 mary       king         NA           king         
22     3  1991 mary       king         NA           king         
23     3  1992 mary       king         NA           king         
24     3  1993 mary       king         NA           king         
25     3  1994 mary       king         NA           king         
26     3  1995 mary       king fisher  king         king         
27     3  1996 mary       king fisher  king         king         
28     3  1997 mary       king fisher  king         fisher       
29     3  1998 mary       king fisher  king         fisher       
30     3  1999 mary       king fisher  king         fisher       
31     4  1990 liza       moore        NA           moore        
32     4  1991 liza       moore        NA           moore        
33     4  1992 liza       moore        NA           moore        
34     4  1993 liza       moore        NA           moore        
35     4  1994 liza       moore        NA           moore        
36     4  1995 liza       moore        NA           moore        
37     4  1996 liza       moore        NA           moore        
38     4  1997 liza       neville      NA           neville      
39     4  1998 liza       neville      NA           neville      
40     4  1999 liza       neville      NA           neville

Edit string value based on value in another column using r

1 Answers