0
votes

The iris dataset has 50 entries for each of the three species:

data('iris')
table(iris$Species)

setosa versicolor  virginica 
    50         50         50

With the iris dataset subsetted into two data frames (with overlapping species and asymmetric columns), and merged with an outer join:

# missing Petal.Width
SV <- subset(iris, Species == 'setosa' | Species == 'virginica',
            select = c('Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Species'))
# missing Sepal.Length
VV <- subset(iris, Species == 'versicolor' | Species == 'virginica',
            select = c('Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'))
SV_VV_merge <- merge(SV,VV,all=TRUE)

I find 16 extra entries for virginica:

table(SV_VV_merge$Species)

setosa versicolor  virginica 
    50         50         66

How can I see which rows in the merged dataframe have duplicates for the shared columns 'Sepal.Width' 'Petal.Length' 'Species' for the species 'virginica'?

2

2 Answers

1
votes

We can use duplicated

duplicated(SV_VV_merge)
SV_VV_merge[duplicated(SV_VV_merge), ]

Results can be confirmed by counting the unique rows

nrow(unique(SV_VV_merge))
[1] 154

Please note that you are merging different column name subsets and might not be getting the expected results.

intersect(names(VV), names(SV))
[1] "Sepal.Width"  "Petal.Length" "Species"   
1
votes

Maybe not the most straight-forward, but add an indicator column to each dataframe, and concatenate.

SV <- subset(iris, Species == 'setosa' | Species == 'virginica',
        select = c('Sepal.Length', 'Sepal.Width', 'Petal.Length','Species'))
SV$sv_src <- "SV"


# missing Sepal.Length
VV <- subset(iris, Species == 'versicolor' | Species == 'virginica',
        select = c('Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'))
VV$vv_src <- "VV"


SV_VV_merge <- merge(SV,VV,all=TRUE)
SV_VV_merge$row_src <- apply(SV_VV_merge[c("sv_src", "vv_src")], 1, 
                      function(x) paste(na.omit(x), collapse = ""))

SV_VV_merge[, c("Sepal.Width", "Species", 'sv_src', 'vv_src', 'row_src')]

#    Sepal.Width    Species sv_src vv_src row_src
#1           2.0 versicolor   <NA>     VV      VV
#2           2.2 versicolor   <NA>     VV      VV
#3           2.2 versicolor   <NA>     VV      VV
#4           2.2  virginica     SV     VV    SVVV
#5           2.3     setosa     SV   <NA>      SV