0
votes

I have two matrix, one is an index matrix (ncol=1, nrow=20,000), storing the values that I want to search for, the other matrix is a data matrix, storing a large dataset (ncol=1, nrow=5,000).

index matrix: (water, meat, gas.... are row names)

water    DFAFADFADF
meat     QEREQRQTQTQ
gas      FEQQFQEFQF
.
.
.
..

data matrix: (Tom, Luis, Jerry, Vincent, Richard... are row names)

Tom              dfqfqfAFADFADaveffefd
Luis             eqeqfqefAFADFADuouojoimoij
Jerry            dafadfe3321AFADFADfdeff
Vincent          e31413413qeffffff
Richard          121eefq3ffAFADFADfffqffqff
.
.
.
..

I want to find for each value in index matrix, which row(s) of data matrix CONTAIN(S) that string, and record its data matrix's row name and put it in the following columns (or in a single column but separated by ",") of that string's row in the index matrix.

For example, I want to have a loop and first take the value "DFAFADFADF" from index matrix and search for which rows in data matrix contains this string, so I found that Tom, Luis, Jerry, Richard in data matrix contains that string, so I update the index matrix to be

index matrix:

water    DFAFADFADF    Tom, Luis, Jerry, Richard
meat     QEREQRQTQTQ
gas      FEQQFQEFQF
.
.
.
..

And then I take the next value in index matrix, QEREQRQTQTQ, to search the data matrix again and then go back to update the index matrix again, until I finished the last row of the index matrix.

Can any one help with a loop? I guess we may need a loop, using for (....), but don't know how .

2
You may use %in% or matchakrun
Can you please give an example? Many Thanks! I am totally naive on this...Qing Wang

2 Answers

0
votes
index <- data.frame(one = c("ABC", "DEF", "GHI", "JKL"))
rownames(index) <- c("water", "meat", "fruit", "bread")
data <- data.frame(one = c("ABCDEF", "DEFZMN", "MNOABC", "ZXCJKL"))
rownames(data) <- c("Tom", "Jerry", "Rob", "Nate")

results <- data.frame()
for (r in 1:nrow(index)) {
    index$results[r] <- list(rownames(data)[grep(index$one[r], data$one, ignore.case = T)])
    count <- length(unlist(index$results[r]))
    df <- data.frame(data_match = unlist(index$results[r]),
                     pattern = rep(index$one[r], times = count),
                     index_match = rep(rownames(index)[r], times = count))
    results <- rbind(results, df)
}
reshape2::dcast(results, index_match ~ data_match)

This will generate a list() in the column index$results, so you might need to call unlist() on it depending on how you want to handle that information downstream. Also in R you can have named vectors and in a case were you have one column data frame maybe you might just need a named character vector like this:

index <- c("ABC", "DEF", "GHI", "JKL")
names(index) <- c("water", "meat", "fruit", "bread")

Might make the matching simpler for the next time.

0
votes

Shorter solution:

row.names(data)[apply(data, 1, function(x) {
  sapply(x, function(y) y %in% c("DFAFADFADF", "QEREQRQTQTQ", "FEQQFQEFQF"))
}), ]