0
votes

I have two data frames,

df1

 Identifier    GSE1028888  GSE1034555
     100002           0.1         0.2
     100003           0.3         0.4
     ......         .....       .....
     100007           0.9         1.1

df2

         V3            V2
     100002         XLX12         
     100003         ABorF        
     ......         .....       
     110000         GEF22

Now I want insert the V2 information into df1, such as

df3

 Identifier        New_V2    GSE1028888  GSE1034555
     100002         XLX12           0.1         0.2
     100003         ABorF           0.3         0.4
     100004            NA           0.6         0.7
     ......         .....       .....
     100007         ccL34           0.9         1.1           

The V3 of df2 and Identifier of df1 have different length. I try dplyr left_join, but the column is attached at the end.

This is the code to create similar data frame

df1 <- data.frame("Identifier" = sample(100001:100010, 6, replace = F), 
               y = sample(rnorm(10), 6, replace = T), 
               z = sample(rnorm(10), 6, replace = T))
df2 <- data.frame(V1 = c(100001:100010), 
              V2 = sample(state.name, 10, replace = T))

This time when I try dplyr left_join,

left_join(df1, df2, by =c("Identifier"="V3"))

an error message was shown

Error: cannot join on columns 'V3' x 'Identifier': index out of bounds

Anyone has any idea?

2
Your question does not contain a reproducible example. It is therefore hard to understand your problem and give you an appropriate answer. Please make your data available (e.g. by using dput()) or use one of the example data sets in R. Also, add the minimal code required to reproduce your problem to your post.Stibu

2 Answers

1
votes

Using the data you provided:

df1 <- data.frame("Identifier" = sample(100001:100010, 6, replace = F), 
               y = sample(rnorm(10), 6, replace = T), 
               z = sample(rnorm(10), 6, replace = T))
df2 <- data.frame(V1 = c(100001:100010), 
              V2 = sample(state.name, 10, replace = T))

merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE)
       V1            V2          y          z
1  100001     Wisconsin -1.9468637  0.4509951
2  100002      Nebraska  2.5155819  0.4509951
3  100003          Ohio         NA         NA
4  100004 Massachusetts         NA         NA
5  100005       Montana -1.9468637 -2.1825878
6  100006      Illinois -0.1591367  0.3445637
7  100007    New Mexico -0.5696300  0.4509951
8  100008    New Jersey         NA         NA
9  100009     Tennessee         NA         NA
10 100010    Washington -1.9468637 -0.5402241

And you can toggle the order which ever way you like but the above output is what you requested -no need to reorder but if wanted to:

merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE)[c(2,3,4,1)]
              V2          y          z     V1
1      Wisconsin -1.9468637  0.4509951 100001
2       Nebraska  2.5155819  0.4509951 100002
3           Ohio         NA         NA 100003
4  Massachusetts         NA         NA 100004
5        Montana -1.9468637 -2.1825878 100005
6       Illinois -0.1591367  0.3445637 100006
7     New Mexico -0.5696300  0.4509951 100007
8     New Jersey         NA         NA 100008
9      Tennessee         NA         NA 100009
10    Washington -1.9468637 -0.5402241 100010

As per the OP's posted comments below, here is an example where an identifier exists in df1 that does not exist in df2

df1 <- data.frame("Identifier" = sample(100001:100012, 6, replace = F), 
               y = sample(rnorm(10), 6, replace = T), 
               z = sample(rnorm(10), 6, replace = T))

df1
  Identifier           y            z
1     100011 -1.60532712  1.365836073
2     100007 -1.28821500  0.005925986
3     100004 -0.03444609  0.780708952
4     100006  0.32190045  0.780708952
5     100009 -1.60532712 -1.471916384
6     100005 -0.76985033  0.191956916

df2 <- data.frame(V1 = c(100001:100010), 
              V2 = sample(state.name, 10, replace = T))
df2
       V1            V2
1  100001  Pennsylvania
2  100002 West Virginia
3  100003          Utah
4  100004        Alaska
5  100005          Ohio
6  100006   Mississippi
7  100007 New Hampshire
8  100008    New Jersey
9  100009          Ohio
10 100010       Georgia

merge(df2, df1, by.x="V1",by.y="Identifier", all.x=TRUE, all.y=TRUE)
       V1            V2           y            z
1  100001  Pennsylvania          NA           NA
2  100002 West Virginia          NA           NA
3  100003          Utah          NA           NA
4  100004        Alaska -0.03444609  0.780708952
5  100005          Ohio -0.76985033  0.191956916
6  100006   Mississippi  0.32190045  0.780708952
7  100007 New Hampshire -1.28821500  0.005925986
8  100008    New Jersey          NA           NA
9  100009          Ohio -1.60532712 -1.471916384
10 100010       Georgia          NA           NA
11 100011          <NA> -1.60532712  1.365836073
4
votes

One solution is to merge and reorder them (unfortunately not a one-function answer).

Setup

df1 = data.frame(i=c(1,2,3), GSE111=c(4,5,6), GSE222=c(7,8,9))
df2 = data.frame(i=c(1,3,4), v2=c(10,11,12))

This produces:

> df1
  i GSE111 GSE222
1 1      4      7
2 2      5      8
3 3      6      9
4 4      7     10

> df2
  i v2
1 1 10
2 3 11
3 4 12

Merge:

I've used built-in merge function instead of plyr.

df3 = merge(x=df1, y=df2, by="i", all.x=TRUE)

this will yield:

  i GSE111 GSE222 v2
1 1      4      7 10
2 2      5      8 NA
3 3      6      9 11

Reorder:

Now you can reorder by literally typing column names:

df4 = df3[,c("i", "v2", "GSE111", "GSE222")]

which yields

  i v2 GSE111 GSE222
1 1 10      4      7
2 2 NA      5      8
3 3 11      6      9

Which is essentially the product you wanted.



Shortcut:

Or, if you are lazy like me, I can generate unique column names between df1 and df2 like this:

unique(c(colnames(df2), colnames(df1)))

This yields:

[1] "i"      "v2"     "GSE111" "GSE222"

So you can type something like:

df3[,unique(c(colnames(df2), colnames(df1)))]