0
votes

I have two data.frames with thousands of rows and dozens of columns each, both created by merging several csv files. The data.frames are exactly what I wanted. I shall also add that that df1 and df2 have several columns in common. Only problem is that in one of them, say df1 there are, for some columns, some NAs (this is expected/normal). The good thing is that the same columns in which I have NAs also occur in the second data.frame, say df2, but with no NAs. What I would like to do is to fill the NAs in a given column of df2 with values from the same column of df1 where each value is selected based on the matching of other columns value between df1 and df2.

Using some random data:

A<- format(seq(as.Date("2021/09/01"), by = "day", length.out = 26), format="%Y%m%d")
B<- format(seq(as.POSIXct("2021-1-1 15:00"), as.POSIXct("2021-1-02 16:00"), by = "hour"), format = "%H:%M")
C<- sample(1:100, 26, replace=FALSE)
D<- LETTERS
E<- paste(D, C)

df1<- cbind(A, B, C, D)
df2<- cbind(A, B, C, E)
df2[c(7:10, 13, 18:21), 3] <- NA #replace some of the values with NAs
df2<- df2[-c(2,5,11,17,23,26),] #delete some columns so df1 and df2 are of different size

I would like that for df2$C, when a NA value is found, both the corresponding df2$A and df2$B values are matched against all df1$A and df2$B. When, and only if, a row in df1 is found with matching df1$A and df1$B, the corresponding value of df1$C is copied in df2$C NA value. In other words, for the example data above, if there is a NA value in df2$C, the corresponding combination of date and time for that specific row should be matched against all the resulting concatenation of date and time from df1 and when a match is found, the corresponding value of df1$C is used to replace the NA in df2$C.

# as an instance, a case of my df2 where a NA occurs is
df2[17,3] 

#This should be replaced with the value from
df1[21,3] 

# because the time and date of 
df2[17,] 

A          B          C          E 
"20210921"    "11:00"         NA     "U 46" 

#is the same than 
df1[21,] 

A          B          C          D 
"20210921"    "11:00"       "46"        "U" 

I am 100% sure that a concatenation of column A and B in each of the data.frame gives unique results in each data.frame so that a concatenation of df1$A and df1$B will be unique and the same applies to df2. Also, with certainty, there will be one and only value resulting from the concatenation of df2$A and df2$B matching the concatenation of column A and B in df1. For this reason I have tried concatenating for each of the data.frames the columns A and B into a new column to find matching values using match(paste(...,...)) within a elseif loop in Base R and alternatively other solutions by using the dplyr library proposed in other fora, but I can't find a solution that works for me. I think this is similar to an Excel lookup/vlookup function with multiple criteria but I can't figure this one out in R.

Any suggestion on how to proceed? Thanks.

2
First thing, you are missing code for 'A'. Second thing, it would be a lot easier to help you if you posted your desired output as code. In other words, show us what your desired output should look like.xilliam
I have edited the question and the code.Nino Pierantonio

2 Answers

0
votes

Since you mention dplyr already; it provides the tools for the job. Simply left_join on the columns you wish to match, then coalesce the results to replace missing values of 'C' with matching values from the second table.

library(dplyr)
df1 <- as.data.frame(df1)
df2 <- as.data.frame(df2)

concatenate <- df2 %>%
    left_join(df1, by = c('A', 'B'), suffix = c('', '.y')) %>%
    mutate(C = coalesce(C, C.y)) %>%
    select(-c(C.y))
0
votes

You can create a unique key to update df2.

unique_key1 <- paste(df1$A, df1$B)
unique_key2 <- paste(df2$A, df2$B)
inds <- is.na(df2$C)
df2$C[inds] <- df1$C[match(unique_key2[inds], unique_key1)]
df2

#         A     B  C    E
#1  20210901 15:00 74 A 74
#2  20210903 17:00 27 C 27
#3  20210904 18:00 60 D 60
#4  20210906 20:00  7  F 7
#5  20210907 21:00 96 G 96
#6  20210908 22:00 98 H 98
#7  20210909 23:00 38 I 38
#8  20210910 00:00 89 J 89
#9  20210912 02:00 69 L 69
#10 20210913 03:00 72 M 72
#11 20210914 04:00 76 N 76
#12 20210915 05:00 63 O 63
#13 20210916 06:00 13 P 13
#14 20210918 08:00 25 R 25
#15 20210919 09:00 92 S 92
#16 20210920 10:00 21 T 21
#17 20210921 11:00 79 U 79
#18 20210922 12:00 41 V 41
#19 20210924 14:00 97 X 97
#20 20210925 15:00 16 Y 16

data

cbind creates a matrix, use data.frame to create dataframes.

df1 <- data.frame(A, B, C, D)
df2 <- data.frame(A, B, C, E)