1
votes

I wish to extract "pair of numbers", i.e. numbers in adjacent columns within the same row. Then I want to count the pairs to determine which are most frequent.

As example I created a dataset with 5 columns and 4 rows:

var1 var2 var3 var4 var5
   1    2    3    0   11
   2    0    3    0    1
   3    0    3    1    2
   4    1    2    2   11

The most frequent consecutive pairs of number are:

1 -> 2: 3 times (row 1, var1 -> var2; row 3, var4 -> var5; row 4, var2 -> var3)

3 -> 0: 3 times (row 1, var3 -> var4; row 2, var3 -> var4; row 3, var1 -> var2)

0 -> 3: 2 times

I am struggling with the code that identifies the most frequent 'consecutive pair of numbers'?

How can I replace the identified consecutive pair of number with 2 and the others with 0?

2

2 Answers

0
votes

A base alternative.

1. Find and count pairs

Because you only have numerical values, we coerce data to matrix. This will make subsequent calculations considerably faster. Create lag and lead versions (column-wise) of the data, i.e. remove the last column (m[ , -ncol(m)]) and first column (m[ , -ncol(m)]) respectively.

Coerce the lag and lead data to 'from' and 'to' vectors, and count pairs (table). Convert table to matrix. Select first pair with max frequency.

m <- as.matrix(d)
tt <- table(from = as.vector(m[ , -ncol(m)]), to = as.vector(m[ , -1]))
m2 <- cbind(from = as.integer(dimnames(tt)[[1]]),
            to = rep(as.integer(dimnames(tt)[[2]]), each = dim(tt)[1]),
            freq = as.vector(tt))      
m3 <- m2[which.max(m2[ , "freq"]), ]
# from   to freq 
#    3    0    3

If you want all pairs with maximum frequency, use m2[m2[ , "freq"] == max(m2[ , "freq"]), ] instead.


2. Replace values of most frequent pair and set rest to zero

Make a copy of the original data. Fill it with zero. Grab the 'from' and 'to' values of the 'max pair'. Get indexes of matches in lag and lead data, which correspond to 'from' columns. rbind with indexes of 'to' columns. At the indexes, replace zeros with 2.

m_bin <- m
m_bin[] <- 0
ix <- which(m[ , -ncol(m)] == m3["from"] &
              m[ , -1] == m3["to"],
            arr.ind = TRUE)
m_bin[rbind(ix, cbind(ix[ , "row"], ix[ , "col"] + 1))] <- 2
m_bin
#      var1 var2 var3 var4 var5
# [1,]    0    0    2    2    0
# [2,]    0    0    2    2    0
# [3,]    2    2    0    0    0
# [4,]    0    0    0    0    0

3. Benchmark

I use data of a size somewhat similar to that mentioned by OP in comment: a data frame with 10000 rows, 100 columns, and sampling from 100 different values.

I compare the code above (f_m()) with the zoo answer (f_zoo(); functions below). To compare the output, I add dimnames to the zoo result.

The result shows that f_m is considerably faster.

set.seed(1)
nr <- 10000
nc <- 100
d <- as.data.frame(matrix(sample(1:100, nr * nc, replace = TRUE),
                          nrow = nr, ncol = nc))

res_f_m <- f_m(d)
res_f_zoo <- f_zoo(d)
dimnames(res_f_zoo) <- dimnames(res_f_m)

all.equal(res_f_m, res_f_zoo)
# [1] TRUE

system.time(f_m(d))
# user  system elapsed 
# 0.12    0.01    0.14 

system.time(f_zoo(d))
# user  system elapsed 
# 61.58   26.72   88.45

f_m <- function(d){
  m <- as.matrix(d)
  tt <- table(from = as.vector(m[ , -ncol(m)]),
              to = as.vector(m[ , -1]))
  m2 <- cbind(from = as.integer(dimnames(tt)[[1]]),
              to = rep(as.integer(dimnames(tt)[[2]]),
                       each = dim(tt)[1]),
              freq = as.vector(tt))

  m3 <- m2[which.max(m2[ , "freq"]), ]
  m_bin <- m
  m_bin[] <- 0
  ix <- which(m[ , -ncol(m)] == m3["from"] &
                m[ , -1] == m3["to"],
              arr.ind = TRUE)
  m_bin[rbind(ix, cbind(ix[ , "row"], ix[ , "col"] + 1))] <- 2
  return(m_bin)
}


f_zoo <- function(d){
  pairs <- sort(table(c(rollapply(t(d), 2, toString))))
  top <- scan(text = names(tail(pairs, 1)), sep = ",", what = 0L, quiet = TRUE)
  right <- rollapplyr(unname(t(d)), 2, identical, top, fill = FALSE)
  left <- rbind(right[-1, ], FALSE)
  t(2 * (left | right))
  }
1
votes
library(zoo)
pairs <- sort(table(c(rollapply(t(DF), 2, toString))))

# all pairs with their frequency
pairs
##  0, 1 0, 11  2, 0 2, 11  2, 2  2, 3  3, 1  4, 1  0, 3  1, 2  3, 0 
##     1     1     1     1     1     1     1     1     2     3     3 

# same but as data.frame
data.frame(read.table(text = names(pairs), sep = ","), freq = c(pairs))
##       V1 V2 freq
## 0, 1   0  1    1
## 0, 11  0 11    1
## ...
## 0, 3   0  3    2
## 1, 2   1  2    3
## 3, 0   3  0    3

# pair with highest frequency - picks one if there are several
tail(pairs, 1)
## 3, 0 
##    3 

# all pairs with highest frequency
pairs[pairs == max(pairs)]
## 1, 2 3, 0 
##    3    3 

To replace all 3,0 pairs with 2 and everything else with 0:

top <- scan(text = names(tail(pairs, 1)), sep = ",", what = 0L, quiet = TRUE)
right <- rollapplyr(unname(t(DF)), 2, identical, top, fill = FALSE)
left <- rbind(right[-1, ], FALSE)
t(2 * (left | right))
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    2    2    0
## [2,]    0    0    2    2    0
## [3,]    2    2    0    0    0
## [4,]    0    0    0    0    0

Note

The input DF in reproducible form is:

Lines <- "1     2     3   0    11
2     0     3   0     1
3     0     3   1     2
4     1     2   2     11"
DF <- read.table(text = Lines)