Combining vectors of unequal length and non-unique values

Question

I would like to do the following:

combine into a data frame, two vectors that

have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element

The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.

For example:

vector 1    vector 2                     vector 1        vector 2
   1           1                            a               a
   2           2                            g               g
   3           3                            b               b
   4           1            or              h               a
   1           2                            a               g
   2           3                            g               b   
   5           4                            c               h
               5                                            c

should be combined into data frame

    1   1                                    a   a
    2   2                                    g   g
    3   3                                    b   b
    4   NA                                   h   NA
    1   1                  or                a   a 
    2   2                                    g   g
    NA  3                                    NA  b
    NA  4                                    NA  h
    5   5                                    c   c

What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.

A sequence by definition is an ordered list of objects. Take two sequences, e.g. c("apple", "banana") and c("apple", "orange"): we know from sequence #1 that banana will come after apple; similarly, we know from sequence #2 that orange will come after apple. But what will tell us if banana or orange should come first? We need a way to sort elements across multiple sequences. Can you clarify that aspect of the problem? (In your example, you used integers for which there is an implicit solution. Maybe it's just a matter of confirming that your two vectors are integer.) — flodel
Do both vectors contain the same number of sequences? Is there a special value that always indicate the start of a new sequence (1 in your example)? — flodel

flodel flodel · Accepted Answer · 2012-12-16T04:15:21

I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.

If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.

Part I: given two sequences, find the longest common subsequence

LongestCommonSubsequence <- function(X, Y) {
    m <- length(X)
    n <- length(Y)
    C <- matrix(0, 1 + m, 1 + n)
    for (i in seq_len(m)) {
        for (j in seq_len(n)) {
            if (X[i] == Y[j]) {
                C[i + 1, j + 1] = C[i, j] + 1
            } else {
                C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
            }
        }
    }

    backtrack <- function(C, X, Y, i, j) {
        if (i == 1 | j == 1) {
            return(data.frame(I = c(), J = c(), LCS = c()))
        } else if (X[i - 1] == Y[j - 1]) {
            return(rbind(backtrack(C, X, Y, i - 1, j - 1),
                         data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
        } else if (C[i, j - 1] > C[i - 1, j]) {
            return(backtrack(C, X, Y, i, j - 1))
        } else {
            return(backtrack(C, X, Y, i - 1, j))
        }
    }

    return(backtrack(C, X, Y, m + 1, n + 1))
}

Part II: given two sequences, find the shortest common supersequence

ShortestCommonSupersequence <- function(X, Y) {
    LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
    X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
    Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)   
    ALL <- merge(LCS, X.df, by = "I", all = TRUE)
    ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
    ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
                          ifelse(is.na(ALL$J), 0, ALL$J))), ]
    ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
    ALL
}

Your Example:

ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
                            Y = c("a","g","b","a","g","b","h","c"))
#    J  I    X    Y SCS
# 1  1  1    a    a   a
# 2  2  2    g    g   g
# 3  3  3    b    b   b
# 9 NA  4    h <NA>   h
# 4  4  5    a    a   a
# 5  5  6    g    g   g
# 6  6 NA <NA>    b   b
# 7  7 NA <NA>    h   h
# 8  8  7    c    c   c

(where the two updated vectors are in columns X and Y.)

Combining vectors of unequal length and non-unique values

2 Answers