R alternative to full_join, right_join, merge without increasing rows?

Question

Consider 2 dfs:

df1:

Col A	Col B	Col C
Tom	Ice Cream	0.2
Tom	Candy	0.4
Tom	Lollipop	0.6
Bob	Sweets	0.1
Abe	Sweets	0.1

df2:

Col A	Col D	Col E
Tom	Sweets	0.5
Tom	Chocolate	0.2
Tom	Cola	0.3

If I use full_join, right_join, or merge with by = "Col A" (with all.y=TRUE), I get each "permutation" as its own row:

df3:

Col A	Col B	Col C	Col D	Col E
Tom	Ice Cream	0.2	Sweets	0.5
Tom	Ice Cream	0.2	Chocolate	0.2
Tom	Ice Cream	0.2	Cola	0.3
Tom	Candy	0.4	Sweets	0.5
Tom	Candy	0.4	Chocolate	0.2
Tom	Candy	0.4	Cola	0.3
Tom	Lollipop	0.6	Sweets	0.5
Tom	Lollipop	0.6	Chocolate	0.2
Tom	Lollipop	0.6	Cola	0.3
Bob	Sweets	0.1	N/A	N/A
Abe	Sweets	0.1	N/A	N/A

If I use inner_join, left_join, or merge with by = "Col A" (without all.y=TRUE), I get the same as above except subsetted to only the rows manipulated, for example, rows with Abe and Bob would be removed:

df4:

Col A	Col B	Col C	Col D	Col E
Tom	Ice Cream	0.2	Sweets	0.5
Tom	Ice Cream	0.2	Chocolate	0.2
Tom	Ice Cream	0.2	Cola	0.3
Tom	Candy	0.4	Sweets	0.5
Tom	Candy	0.4	Chocolate	0.2
Tom	Candy	0.4	Cola	0.3
Tom	Lollipop	0.6	Sweets	0.5
Tom	Lollipop	0.6	Chocolate	0.2
Tom	Lollipop	0.6	Cola	0.3

I'm trying to do a merge without increasing the number of rows, because the unique permutations don't matter. It doesn't matter what order they get joined, either. Intended output:

df5:

Col A	Col B	Col C	Col D	Col E
Tom	Ice Cream	0.2	Sweets	0.5
Tom	Candy	0.4	Chocolate	0.2
Tom	Lollipop	0.6	Cola	0.3
Bob	Sweets	0.1	N/A	N/A
Abe	Sweets	0.1	N/A	N/A

Filtering df3 by unique doesn't help because the permutations are indeed unique.

In my scenario, I am 100% sure that every subsequent df to join to df1 has exactly 3 rows, which matches with the original number of rows with "Tom" present in df1$'Col A'.

In my real scenario, this join/merge would be deployed in a loop, where each entry in Col A has exactly 3 occurrences as well. So in later loops, it will do the same for "Abe" and "Bob".

bind_cols() isn't versatile enough because the columns may have already been created in a previous loop (i.e. Col D and Col E would be already created by Tom's loop, with N/As for Abe and Bob's rows). In this scenario, I only want the values to be imputed, not create new columns (which would further conflict by having the same colnames).

Is there a script/function/package to do what I'm intending? Converting to tibbles is fine as well. Maybe a some conditions to combine a _join/merge with a bind_cols?

nniloc nniloc · Accepted Answer · 2021-08-31T16:20:33

The problem is that *_join does not know which row to join with, so it joins to all three. Given your expected output, you would like the Tom rows to join in the same order that they appear in the dataframes. You can add a rowid column, which will let *_join know that you want to join Tom/1 to Tom/1.

library(tidyverse)

# data
df1 <- read.table(text = "
ColA    ColB    ColC
Tom IceCream    0.2
Tom Candy   0.4
Tom Lollipop    0.6
Bob Sweets  0.1
Abe Sweets  0.1", h = T)

df2 <- read.table(text = "
ColA    ColD    ColE
Tom Sweets  0.5
Tom Chocolate   0.2
Tom Cola    0.3", h = T)



# add row ids
df1 <- df1 %>%
  group_by(ColA) %>%
  mutate(rowid = row_number())

df2 <- df2 %>%
  mutate(rowid = row_number())


# join
left_join(df1,df2)
#> Joining, by = c("ColA", "rowid")
#> # A tibble: 5 x 6
#> # Groups:   ColA [3]
#>   ColA  ColB      ColC rowid ColD       ColE
#>   <chr> <chr>    <dbl> <int> <chr>     <dbl>
#> 1 Tom   IceCream   0.2     1 Sweets      0.5
#> 2 Tom   Candy      0.4     2 Chocolate   0.2
#> 3 Tom   Lollipop   0.6     3 Cola        0.3
#> 4 Bob   Sweets     0.1     1 <NA>       NA  
#> 5 Abe   Sweets     0.1     1 <NA>       NA

^{Created on 2021-08-31 by the reprex package (v2.0.0)}

An alternative to assigning rowids would be to subset df1 based on ColA, run cbind/bind_cols, then rejoin with df1.

df1 %>%
  filter(ColA == unique(df2$ColA)) %>%
  bind_cols(df2[,-1]) %>%
  right_join(df1)

#-----------
Joining, by = c("ColA", "ColB", "ColC")
  ColA     ColB ColC      ColD ColE
1  Tom IceCream  0.2    Sweets  0.5
2  Tom    Candy  0.4 Chocolate  0.2
3  Tom Lollipop  0.6      Cola  0.3
4  Bob   Sweets  0.1      <NA>   NA
5  Abe   Sweets  0.1      <NA>   NA

R alternative to full_join, right_join, merge without increasing rows?

1 Answers