4
votes

I would like to do the following:

combine into a data frame, two vectors that

  • have different length
  • contain sequences found also in the other vector
  • contain sequences not found in the other vector
  • sequences that are not found in other vector are never longer than 3 elements
  • always have same first element

The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.

For example:

vector 1    vector 2                     vector 1        vector 2
   1           1                            a               a
   2           2                            g               g
   3           3                            b               b
   4           1            or              h               a
   1           2                            a               g
   2           3                            g               b   
   5           4                            c               h
               5                                            c

should be combined into data frame

    1   1                                    a   a
    2   2                                    g   g
    3   3                                    b   b
    4   NA                                   h   NA
    1   1                  or                a   a 
    2   2                                    g   g
    NA  3                                    NA  b
    NA  4                                    NA  h
    5   5                                    c   c

What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.

2
A sequence by definition is an ordered list of objects. Take two sequences, e.g. c("apple", "banana") and c("apple", "orange"): we know from sequence #1 that banana will come after apple; similarly, we know from sequence #2 that orange will come after apple. But what will tell us if banana or orange should come first? We need a way to sort elements across multiple sequences. Can you clarify that aspect of the problem? (In your example, you used integers for which there is an implicit solution. Maybe it's just a matter of confirming that your two vectors are integer.)flodel
Do both vectors contain the same number of sequences? Is there a special value that always indicate the start of a new sequence (1 in your example)?flodel

2 Answers

3
votes

I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.

If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.

Part I: given two sequences, find the longest common subsequence

LongestCommonSubsequence <- function(X, Y) {
    m <- length(X)
    n <- length(Y)
    C <- matrix(0, 1 + m, 1 + n)
    for (i in seq_len(m)) {
        for (j in seq_len(n)) {
            if (X[i] == Y[j]) {
                C[i + 1, j + 1] = C[i, j] + 1
            } else {
                C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
            }
        }
    }

    backtrack <- function(C, X, Y, i, j) {
        if (i == 1 | j == 1) {
            return(data.frame(I = c(), J = c(), LCS = c()))
        } else if (X[i - 1] == Y[j - 1]) {
            return(rbind(backtrack(C, X, Y, i - 1, j - 1),
                         data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
        } else if (C[i, j - 1] > C[i - 1, j]) {
            return(backtrack(C, X, Y, i, j - 1))
        } else {
            return(backtrack(C, X, Y, i - 1, j))
        }
    }

    return(backtrack(C, X, Y, m + 1, n + 1))
}

Part II: given two sequences, find the shortest common supersequence

ShortestCommonSupersequence <- function(X, Y) {
    LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
    X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
    Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)   
    ALL <- merge(LCS, X.df, by = "I", all = TRUE)
    ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
    ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
                          ifelse(is.na(ALL$J), 0, ALL$J))), ]
    ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
    ALL
}

Your Example:

ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
                            Y = c("a","g","b","a","g","b","h","c"))
#    J  I    X    Y SCS
# 1  1  1    a    a   a
# 2  2  2    g    g   g
# 3  3  3    b    b   b
# 9 NA  4    h <NA>   h
# 4  4  5    a    a   a
# 5  5  6    g    g   g
# 6  6 NA <NA>    b   b
# 7  7 NA <NA>    h   h
# 8  8  7    c    c   c

(where the two updated vectors are in columns X and Y.)

6
votes

Note - this was proposed as an answer to the first version of the OP. The question has been modified since then but the problem is still not well-defined in my opinion.


Here is a solution that works with your integer example and would also work with numeric vectors. I am also assuming that:

  • both vectors contain the same number of sequences
  • a new sequence starts where value[i+1] <= value[i]

If your vectors are non-numeric or if one of my assumptions does not fit your problem, you'll have to clarify.

v1 <- c(1,2,3,4,1,2,5)
v2 <- c(1,2,3,1,2,3,4,5)

v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0)))
v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0)))

align.fun <- function(s1, s2) { #aligns two sequences
  s12 <- sort(unique(c(s1, s2)))
  cbind(ifelse(s12 %in% s1, s12, NA),
        ifelse(s12 %in% s2, s12, NA))
}

do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences))
#       [,1] [,2]
#  [1,]    1    1
#  [2,]    2    2
#  [3,]    3    3
#  [4,]    4   NA
#  [5,]    1    1
#  [6,]    2    2
#  [7,]   NA    3
#  [8,]   NA    4
#  [9,]    5    5