0
votes

I have a two data frame, say 'df1' and 'df2'. df1 has the following column:

Date

and df2 has the following columns:

Date.1, USD.Price, Date.2, EUR.Price, Date.3, JPY.Price, Date.4, INR.Price

where Date, Date.1, Date.2, Date.3, Date.4 ... is in date format.

Now I want to merge Date.1, USD.Price with df1 based on df1$Date and df2$Date.2 as:

df3 = merge(df1, df2[,1:2],  by.x = "Date", by.y = "Date.1", all = TRUE)

Then,

df4 = merge(df3, df2[,3:4],  by.x = "Date", by.y = "Date.2", all = TRUE)

Then again,

df5 = merge(df4, df2[,5:6],  by.x = "Date", by.y = "Date.3", all = TRUE)

Furthermore,

df6 = merge(df5, df2[,7:8],  by.x = "Date", by.y = "Date.4", all = TRUE)

and so on for all 1000 such columns.

For example, lets say, I have a following dataframe:

df1:

Date
2009-10-13
2009-10-14
2009-10-16
2009-10-18
2009-10-19
2009-10-20
2009-10-21
2009-10-22

and df2:

 Date.1      USD.Price   Date.2       EUR.Price     Date.3       JPY.Price      Date.4           INR.Price     
 2009-10-13  21.6        NA           NA            NA            NA         NA                   NA 
 2009-10-14  21.9        2009-10-14   78.2          NA            NA         NA                   NA 
 2009-10-16  22.0        2009-10-16   78.5          NA             NA        2009-10-16           12.2
 NA          NA          2009-10-18   78.9          2009-10-18  32.1       2009-10-18             12.4
NA           NA           NA          NA            2009-10-19  32.6      2009-10-19             12.2  

Then the output needs to be:

Date           USD.Price    EUR.Price    JPY.Price    INR.Price
2009-10-13     21.6         NA           NA           NA
2009-10-14     21.9         78.2         NA           NA
2009-10-16     22.0         78.5         NA           NA
2009-10-18     NA           78.9         32.1         12.4
2009-10-19     NA           NA           32.6         12.2 

I have got some reference: How can I merge multiple dataframes with the same column names?

But in my case column names are different as Date.1, Date.2, Date.3 etc...

Can anyone please help me out how to do this for around 1000 columns aa doing as above is not scalable for many columns?

Thanks

5

5 Answers

0
votes

Maybe this loop could help you out:

for(n in 1:999){
  assign(paste('df',n+2,sep = ''),
         merge(get(paste('df',n,sep = '')), get(paste('df',n+1,sep = ''))[,n:n+1],  
               by.x = 'Date', by.y = paste('Date',n,sep = '.'), all = TRUE),
         envir = .GlobalEnv)
}
1
votes

You can try a recursive function (a function that calls itself).

It takes two data.frames and a column index. It merges the data.frames based on the first column of df1 and the first column of df2 that is subsetted using the idx. Then it calls itself using the new data.frame dfx and df2 while idx is less then the number of columns in df2 - 1.

merge_df <- function(df1, df2, idx) {

  dfx <- merge(df1, df2[, idx:(idx + 1)], by.x = names(df1)[1], 
               by.y = names(df2)[idx])

  if (idx < ncol(df2) - 1) {
    return(merge_df(dfx, df2, idx + 2))
  } else {
    return(dfx)
  }
}

You can use it like this:

df1 <- data.frame(id = 1:10)
df2 <- data.frame(id1 = 1:10,
                  test1 = letters[1:10],
                  id2 = 1:10,
                  test2 = LETTERS[1:10])


df <- merge_df(df1, df2, 1)

This would result in this:

head(df, 10)
   id test1 test2
1   1     a     A
2   2     b     B
3   3     c     C
4   4     d     D
5   5     e     E
6   6     f     F
7   7     g     G
8   8     h     H
9   9     i     I
10 10     j     J
1
votes

You could do this...

datecols <- grep("Date", names(df)) #get date columns

dfDates <- apply(df[,datecols], 1, function(x) x[!is.na(x)][1]) #vector of dates

df2 <- cbind(Date=dfDates, df[,-datecols]) #bind dates to non-date columns

df2
        Date USD.Price EUR.Price JPY.Price INR.Price
1 2009-10-13      21.6        NA        NA        NA
2 2009-10-14      21.9      78.2        NA        NA
3 2009-10-16      22.0      78.5        NA      12.2
4 2009-10-18        NA      78.9      32.1      12.4
5 2009-10-19        NA        NA      32.6      12.2
0
votes

Here's a tidyverse way using your example df1 and df2 with the date columns processed with lubridate:

library(tidyr)
library(dplyr)
library(lubridate)

# reformat df2
df2bis <- 
  df2 %>%
  gather(key = "tmp_key",
         value = "Date",
         starts_with("Date"),
         na.rm = TRUE) %>%
  select(-tmp_key) %>%
  distinct()

 # and merge with df1
 df <- inner_join(df1, df2bis)
0
votes

An efficient way of doing this using sqldf I think.

# Changing column names in df2 for convenience
names(df2) <- c("Date1", "USD_Price", "Date2", "EUR_Price", "Date3", "JPY_Price", "Date4", "INR_Price")

library(sqldf) 
sqldf({"
    SELECT D1.Date, D2.USD_Price, D2.EUR_Price, D2.JPY_Price, D2.INR_Price FROM df1 AS D1
    INNER JOIN df2 AS D2
    ON D1.Date IN (D2.Date1, D2.Date2, D2.Date3, D2.Date4)
"})

#        Date USD_Price EUR_Price JPY_Price INR_Price
#1 2009-10-13      21.6        NA        NA        NA
#2 2009-10-14      21.9      78.2        NA        NA
#3 2009-10-16      22.0      78.5        NA      12.2
#4 2009-10-18        NA      78.9      32.1      12.4
#5 2009-10-19        NA        NA      32.6      12.2