4
votes

Is there a more elegant way to work out the maximum or minimum level of variability (CV) in marginal column totals of a binary matrix based on its fill and size? Considering that all row and column totals must be non-zero. e.g.

foo(n_col, n_row, fill){ get maximum possible CV }

Let's say we have a matrix called m where all column and row totals are > 0 but the matrix is minimally filled.

m <- matrix(rep(0,25), nrow = 5)
diag(m) <- 1
#     [,1] [,2] [,3] [,4] [,5]
#[1,]    1    0    0    0    0
#[2,]    0    1    0    0    0
#[3,]    0    0    1    0    0
#[4,]    0    0    0    1    0
#[5,]    0    0    0    0    1

variability1 <- sd(colSums(m))/mean(colSums(m))
variability1
# [1] 0
# the maximum and minimum for this fill is zero 
# considering that  all column and row totals must be > 0

Perhaps we could check the maximum at increasing levels of fill like:

# find out which matrix elements are zeros
empty <- which(m < 1)
# vector for results
variability <- rep(NA, length(empty))
#
for(i in 1:length(variability)){
m[empty[[i]] ] <- 1
variability[[i]] <- sd(colSums(m))/mean(colSums(m))
}
# we get what should the maximum CV for each given level of matrix fill...
c(variability1, variability)

I think filling the matrix column-wise like this maintains the maximum variability in the marginal column totals? Is there a simpler way to work this for maximum and minimum variability for matrices of different sizes, fills and shapes?

1

1 Answers

3
votes

The following provides an alternative formulation of the problem as an optimization over the choice of the the vector of column sums of a binary matrix that maximizes the variability for a given fill. Informal arguments of the validity of this formulation and the resulting algorithm to solve it are provided. The resulting algorithm is consistent with the OP's assertion

filling the matrix column-wise like this maintains the maximum variability in the marginal column totals

First, define fill to be the number of 1's in the n_row by n_col binary matrix m. From the constraints of the problem statement that m is a binary matrix with all row and column sums greater than zero, fill is an integer in the range [max(n_row, n_col),n_row*n_col].

The problem is then for a given value of fill in the range [max(n_row, n_col),n_row*n_col], find the maximum

 sd(colSums(m))/mean(colSums(m))

over all m such that m is a binary matrix with fill number of 1's and with all row and column sums greater than zero.

We note that it is better to specify the domain of this optimization problem in terms of the vector of column sums of m rather than m itself. This is because there exist different m's with the same vector of column sums and therefore the same objective value. Denoting the vector of column sums as x, the above optimization problem can be restated as one of maximizing:

sd(x)/mean(x)

such that each element of x is an integer in the range [1, n_row] and sum(x) is fill.

Furthermore, since sum(x) is constrained to be equal to fill, the denominator term mean(x) is constant over all x for a given fill. Consequently, an equivalent objective function to maximize is simply sd(x) or equivalently the variance of x.

To maximize the variance of x, we need to choose x such that the difference between its values are maximized while still satisfying the constraints on x. Here, we can think about this problem inductively with respect to fill. Let's assume that for a given fill, we have the solution for x that maximizes the variance of x while satisfying its constraints. The question becomes: when we increment fill to fill + 1, what is the new x that maximizes its variance? Because we have the constraint that sum(x)=fill and each element in x is an integer, incrementing fill implies that we must increment one and only one element of x. For the moment relax the upper limit constraint on each element in x (i.e., x[i] <= n_row for all i in [1,n_col]), then the question becomes: which element in x to increment that maximizes the increase in the variance of x. For the answer to this question, we can look at the Taylor series expansion of var(x):

var(x + dx) = var(x) + gradient(var(x)) %*% dx + 1/2 * t(dx) %*% Hessian(var(x)) %*% dx

where dx is a vector of length n_col with one element equal to 1 and all other elements 0 (i.e., an indicator vector). Since var(x) is quadratic in x, a second order expansion is sufficient. Furthermore, since dx is an indicator vector, only the diagonal elements of the Hessian matrix matter. These are given by:

gradient(var(x))[i] = 2*(x[i]-mean(x))/(n_col-1),      for all i in [1,n_col]
Hessian(var(x))[i,i] = 2/n_col                  ,      for all i in [1,n_col]

Since all the diagonal terms of the Hessian are the same, the second order term of the Taylor series is the same for any choice of dx. Consequently, only the first order term matters in determining which element in x to increment that maximizes the increase in the variance of x. From the gradient terms, it is clear that we should choose to increment the i-th element in x that has the largest current value x[i] in order to maximize the increase in the variance of x. Now, we reintroduce the upper limit constraint on each element of x. Then, the optimal choice is to increment the i-th element in x that has the largest current value x[i] < n_row. Note that if there are multiple such elements in x that have same maximum value x[i] < n_row, then choosing any one of these will result in the same maximal increase in the variance of x.

What we have shown so far is that given a fill and the solution for x that maximizes the variance of x while satisfying its constraints, we have a rule dx that maximizes the incremental increase in the variance of x for fill + 1. It remains to show that this rule results in a new x that is the optimal x that maximizes the variance of x for the new fill + 1. We now show this by contradiction. Specifically, if this new x does not maximize the variance of x for fill + 1, then there must exist another vector of column sums x_1 for fill and a different rule dx_1 such that

var(x_1 + dx_1) > var(x + dx)

However, since x maximizes var(x) for fill and the equations for the gradient and Hessian holds for any x, we have:

var(x_1 + dx_1) = var(x_1) + gradient(var(x_1)) %*% dx_1 + 1/2 * t(dx_1) %*% Hessian(var(x_1)) %*% dx_1
                <= var(x_1) + 2*(max(x_1)-mean(x_1))/(n_col-1) + constant
                <= var(x) + 2*(max(x)-mean(x))/(n_col-1) + constant
                = var(x + dx)

and hence the contradiction. To explain the steps more clearly:

  1. From step 1 to step 2, the best choice for dx_1 at x_1 is the one that increments the maximum element in x_1 and hence gradient(var(x_1)) %*% dx_1 <= 2*(max(x_1)-mean(x_1))/(n_col-1). Also, the second order term is a constant for all x and dx for a given fill, so we simply state that as a constant.
  2. From step 2 to step 3, we have (i) var(x_1) <= var(x) by our assumption that x maximizes the variance at fill, (ii) gradient(var(x)) %*% dx = 2*(max(x)-mean(x))/(n_col-1) for the optimal rule dx at fill, and (iii) max(x_1) <= max(x) given that x maximizes the variance at fill. To see the latter, consider a common predecessor vector of column sums x_-1 for fill-1. The difference in rules between incrementing x_-1 to x and x_1 is simply a different choice of the element in x_-1 to increment. From the Taylor series expansion of the variance at x_-1, it is clear that the choice of the element in x_-1 to increment to go to x must be greater than or equal to that to go to x_1 because var(x_1) <= var(x). Therefore, max(x_1) <= max(x). Now extend this reasoning to any common vector of column sums at some previous fill-k >= max(n_row, n_col), including the initial fill where the initial vector of column sums is all 1's. Then, for one path choose the optimal incremental rules as defined above to reach x; while for the other path, choose an arbitrary path of incremental rules to reach x_1. Since the optimal rules always increments the largest element of the state at each step (subject to the upper limit), it is clear that once again max(x_1) <= max(x).

Finally, to complete the mathematical induction, we start with the initial fill where x is all 1's. This trivially optimizes var(x) since there are no other choices for x given this initial fill. Now, the optimal incremental rule dx is to choose the first element of x to increment, since all elements are equal. The resulting x + dx trivially maximizes the variance for the initial fill plus one since incrementing any other element of x will result in the same variance.

The above arguments immediately suggests the following algorithm to distribute a value of fill across the vector of column sums:

  1. Loop through each element in the vector of column sums x.
  2. For each i-th element x[i] <- min(n_row, fill - (ncol_-i)). Note that we subtract (n_col-i) from fill so that we can reserve these to fill the rest of the elements of the column sums vector with at least 1 and we limit the amount to n_row to satisfy the constraints of the problem.
  3. Update fill <- fill - x[i]

This algorithm and the associated arguments validate the OP's assertion that

filling the matrix column-wise like this maintains the maximum variability in the marginal column totals

In R, the code looks like:

foo <- function(n_col, n_row, fill) {
  ## preallocate the vector of column sums x and initialize to NA
  x <- rep(NA, n_col)
  for (i in seq_len(n_col)) {
    x[i] <- pmin.int(n_row, fill-(n_col-i))
    fill <- fill - x[i]
  }
  ## compute the variability given the vector of column sums x
  sd(x)/mean(x)
}

Recognizing that the repeated decrementaion of fill in a loop can be replaced by a cumsum, the above simplifies to:

foo <- function(n_col, n_row, fill) {
  x <- pmin.int(pmax.int(cumsum(c(fill-n_col+1,rep(-n_row+1,n_col-1))),1),n_row)
  ## compute the variability given the vector of column sums x
  sd(x)/mean(x)
}

Using this function, we recover the OP's result:

n_col=5
n_row=5
variability <- sapply(max(n_col,n_row):(n_col*n_row), function(fill) foo(n_col, n_row, fill))
print(variability)
## [1] 0.0000000 0.3726780 0.6388766 0.8385255 0.9938080 0.8660254 0.8131156 0.8122329 0.8426501
##[10] 0.7319251 0.6666667 0.6404344 0.6443795 0.5414886 0.4707512 0.4330127 0.4259177 0.3049184
##[19] 0.1944407 0.0931695 0.0000000