Data frame of repeated values

Question

I'm currently working with R. I have a data frame with three names, one per column: year1, year2 and year3. Each column has a set of numeric data.

I want to have a resulting data frame which includes the data that is repeated in two different columns, that is: if num.4 is repeated in year1 and year2 the new data frame has num.4, in the same way, if num.5 is repeated in year2 and year3 the new data frame has num.5 included.

I tried the following code:

newdf1 <- origdf[origdf$year1 == origdf$year2 | origdf$year1 == origdf$year3, c(1)]

newdf2 <- origdf[origdf$year2 == origdf$year3, c(2)]

and then I merged both data frames, but not all the data was included, it contained many NA values.

Then I tried the following code:

newdf <- origdf[origdf$year1 == origdf$year2 | origdf$year1 == origdf$year3 & origdf$year2 == origdf$year3, c(1, 2)]

But it also didn't work, it gave me a resulting data frame with many NA values and some correct values, but not all of the repeated numbers were included.

How can I effectively have a data frame that includes values that are repeated in exactly two of the three different columns of the original data frame, without repeated values (I don't want to have a number that is repeated in all the three columns of the original data frame)?

The expected outcome would be:

>newdf

1 num.4
2 num.5

Also do not show images as no one else can use the data except by retyping it all. Show output of dput(x) where x is the input. — G. Grothendieck
I already edited the question, is it more precise? Thanks for your tips! — user44212

Ekatef Ekatef · Accepted Answer · 2018-02-26T11:32:51

If I understand in proper way, you are looking for intersections between the columns of your data frame, but elements which are common for all three columns should be excluded. Then intersect() function may be a solution. The code may look like that

n_years <- 3
# generate all possible combinations of two indices of considered years
indices_comb <- combn(x = 1:n_years, m = 2)
# apply intersect() along all possible combinations
all_intersects <- sapply(function(i) intersect(origdf[, indices_comb[1, i]], 
    origdf[, indices_comb[2, i]]), X = 1:ncol(indices_comb))

Finely, exclude the elements which are common for all original columns (year1, year2, year3):

# find elements which are common for all pairwise intersections
in_all <- Reduce(intersect, all_intersects)
# combine all pairwise intersections into one vector
in_pairw <- Reduce(all_intersects, f = c)
# exclude the elements which are common for all intersections
newdf <- data.frame(res = setdiff(in_pairw, in_all))

The above solution may be easily scaled for an arbitrary number of original columns (years). But note, please, that only unique combinations are returned. That is, num.4 appears two times in both year1 and year2, only one num.4 will be returned.

Data frame of repeated values

1 Answers