In R: How to consistently replace (/ anonimize) ids or names within two separate data frames

Question

Suppose I have two data frames, df1 and df2. Both data frames have an identifier id. My goal is to merge those data sets on this identifier, but I want to anonimize the names in the id column. However, problem is that I want to so for both data sets individually, thus for df1 and df2 and not directly on df3 (because that would be easy: just replace the id column with some random characters)

I think my solution would need to look something like this. First, I make a separate dataframe consisting of all unique ids from both df1 and df2. Then, I would need to assign some randomization, for example, idxxxx where xxxx is an unique number for each id in this separate data frame. With a dplyr, gsub, or stringr approach I can replace the ids from df1 and df2 according to the value assigned in this separate data frame. After this, I will merge the two data sets.

Here I have two example data frames, my try to solve the problem, and the desired result. Note that the the number of ids does not really matter to me (e.g., it does not matter if John Terry has id0004 or id0003, as long as it is consistently changed within both data frames.

Can someone help me out with this? Thanks!

id <- c("John Williams", "John Terry", "Rick Fire", "Katie Blue", "Unknown")
row1 <- c("28", "17", "17", "29", "39")
df1 <- data.frame(id,row1)

id <- c("Frank Johnson", "John Terry", "Rick Fire", "Katie Blue")
row2 <- c("Purple", "Red", "Yellow", "Green")
df2 <- data.frame(id,row2)

df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)

#My try
#Make separate data frame
id_df <- merge(df1, df2, all.x = TRUE, all.y = TRUE)
id_df <- subset(id_df,TRUE,select = c(id))
id_df$anonymous <- id_df %>% mutate(id = row_number()) #it would be nicer to have something like id0001

#Replace ids within df1 and df2 according to the id_df anonymous variable
library(stringr)
df1$id <- str_replace(df1$id, id_df$id, as.character(id_df$anonymous)) #replacement does not work

#desired result
#df1        row1
#id0003     28 
#id0002     17
#id0005     17
#id0004     29
#id0006     39

#df2        row2
#id0001     Purple
#id0002     Red
#id0005     Yellow
#id0004     Green

#df3
#id         #row1       #row2
#id0001     NA          Purple
#id0002     17          Red
#id0003     28          NA
#id0004     29          Green
#id0005     17          Yellow
#id0006     39          NA

Allan Cameron Allan Cameron · Accepted Answer · 2020-12-01T13:28:42

If you want to anonymize the ids and make it reasonably difficult to reverse them, you could calculate the md5 hash of each string. Two identical strings will produce the same md5 hash:

df1$id <- sapply(df1$id, digest::digest, algo = "md5")
df2$id <- sapply(df2$id, digest::digest, algo = "md5")
df3 <- merge(df1, df2, all.x = TRUE, all.y = TRUE)

df3
#>                                 id row1   row2
#> 1 22e35044ed452870ad5b014e87121d9d   39   <NA>
#> 2 69d61b42a2f549c4765699f06de3b351   28   <NA>
#> 3 ad1cc76e26c5d73ba4a03bf51df1b6af   17 Yellow
#> 4 b3bdcc4913da319308e6ddf47e09da12 <NA> Purple
#> 5 badea53ae1e8a2fa66ebd1cdde9dd413   17    Red
#> 6 d1f305c19a2f9649abe11efcf26ac645   29  Green

In R: How to consistently replace (/ anonimize) ids or names within two separate data frames

3 Answers