0
votes

I am trying to merge two data frames using two common columns using the following code.

data = merge(df1, df2,by.x=c("b_id"), by.y=c("e_id"), all=T)

This works fine. BUT there are some rows (cases of data) which have an ID and data for the second data frame, and not the first (and vice versa). This means I return lines of NA for the first data frame (or vice versa).

I am wondering how I could return a merged data set where the second data frames ID number is appended to the first data frames ID number in the merged data frame. In programmes like SPSS or STATA it does this automatically if you merge two data sets with differing completeness of data.

e.g. I want to return this.

    b_id  dfv1  dfv2
    1101    5   NA
    1102    5   5
    1103    8   9
    1104    NA  3
    1105    NA  12

Not this!

    b_id  dfv1 dfv2
    1101    5   NA
    1102    5   5
    1103    8   9
    NA      NA  3
    NA      NA  12

From these two dataframes:

    b_id  dfv1              
    1101    5               
    1102    5               
    1103    8               
    NA      NA              
    NA      NA              

    e_id    dfv2              
    NA      NA              
    1102    5               
    1103    9               
    1104    3               
    1105    12   

Thanks

2
Can you give an example (using sample data) of how you want the merge to be done? (I don't know STATA/SPSS merge functions and I'm not sure I understand what you mean) - talat
What is the input (df1 and df2) for the desired output you included in your question? - talat
Input in the example in my question is df1 which hypothetically has two columns 1 a column of data (dfv1) and 1 a column with a unique id number (b_id). For data frame 2 there is 1 column of data (dfv2) and 1 column with a unique id number (e_id) which gets dropped when you merge. - Ash
It is very good that you add the desired output, but why don't you just also include the input for the small example? This would make answering much easier for everyone. - talat

2 Answers

1
votes

Since the input data is not provided, it is kind of hard to be sure about what is asked for. Based on what I understand from the question, the input could look like this:

df1 <- data.frame(b_id = c(1101, 1102, 1103), dfv1 = c(5,5,8))
df2 <- data.frame(e_id = c(1102, 1103,1104,1105), dfv2 = c(5,9,3,12))

> df1
  b_id dfv1
1 1101    5
2 1102    5
3 1103    8

> df2
  e_id dfv2
1 1102    5
2 1103    9
3 1104    3
4 1105   12

Now, if you run

merge(df1, df2, by.x = "b_id", by.y = "e_id", all = TRUE)

  b_id dfv1 dfv2
1 1101    5   NA
2 1102    5    5
3 1103    8    9
4 1104   NA    3
5 1105   NA   12

Does this answer the question? If not, please edit your question to include the input data.

Update

With the input data provided, it is now possible to answer your question. This seems to produce what you are looking for with the input data you provided:

merge(df1[complete.cases(df1),], df2[complete.cases(df2),], by.x = "b_id", by.y = "e_id", all = T)

  b_id dfv1 dfv2
1 1101    5   NA
2 1102    5    5
3 1103    8    9
4 1104   NA    3
5 1105   NA   12

So basically you exclude all rows that are not complete in each data.frame and the merge the two (which creates some new NA as in your desired output).

0
votes

Try to use data = merge(df1, df2, all.x = TRUE, by=c("b_id","e_id")) I did it some days ago ! It worked for me !