Suppose I have two data frames, df1
and df2
. Both data frames have an identifier id
. My goal is to merge those data sets on this identifier, but I want to anonimize the names in the id
column. However, problem is that I want to so for both data sets individually, thus for df1
and df2
and not directly on df3
(because that would be easy: just replace the id
column with some random characters)
I think my solution would need to look something like this. First, I make a separate dataframe consisting of all unique id
s from both df1
and df2
. Then, I would need to assign some randomization, for example, idxxxx
where xxxx
is an unique number for each id
in this separate data frame. With a dplyr
, gsub
, or stringr
approach I can replace the id
s from df1
and df2
according to the value assigned in this separate data frame. After this, I will merge the two data sets.
Here I have two example data frames, my try to solve the problem, and the desired result. Note that the the number of ids does not really matter to me (e.g., it does not matter if John Terry has id0004 or id0003, as long as it is consistently changed within both data frames.
Can someone help me out with this? Thanks!
id <- c("John Williams", "John Terry", "Rick Fire", "Katie Blue", "Unknown")
row1 <- c("28", "17", "17", "29", "39")
df1 <- data.frame(id,row1)
id <- c("Frank Johnson", "John Terry", "Rick Fire", "Katie Blue")
row2 <- c("Purple", "Red", "Yellow", "Green")
df2 <- data.frame(id,row2)
df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
#My try
#Make separate data frame
id_df <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
id_df <- subset(id_df,TRUE,select = c(id))
id_df$anonymous <- id_df %>% mutate(id = row_number()) #it would be nicer to have something like id0001
#Replace ids within df1 and df2 according to the id_df anonymous variable
library(stringr)
df1$id <- str_replace(df1$id, id_df$id, as.character(id_df$anonymous)) #replacement does not work
#desired result
#df1 row1
#id0003 28
#id0002 17
#id0005 17
#id0004 29
#id0006 39
#df2 row2
#id0001 Purple
#id0002 Red
#id0005 Yellow
#id0004 Green
#df3
#id #row1 #row2
#id0001 NA Purple
#id0002 17 Red
#id0003 28 NA
#id0004 29 Green
#id0005 17 Yellow
#id0006 39 NA