2
votes

Imagine you have two histograms with an equal number of bins. N observations are distributed among the bins. Each bin now has between 0 and N observations.

What algorithm would be appropriate for determining the minimum number of observations to remove from both histograms in order to make them proportional? They do not need to be equal in absolute number, only proportional to each other. That is, there must be a common factor by which all the bins in one histogram can be multiplied in order to make it equal to the other histogram.

For example, imagine the following two histograms, where the item i in each histogram refers to the number of observations in bin i for the respective histogram.

Histogram 1: 4, 7, 4, 9
Histogram 2: 2, 0, 2, 1

For these histograms, the solution would be to remove from histogram 1 all 7 observations in bin 2 and another 7 observations from bin 4, such that (histogram 1)*2 = histogram 2.

But what general algorithm could be used to find the subsets of the two histograms that maximized the number of total observations between them while making them proportional? You can drop observations from both histograms or just one.

Thanks!

1
“Distribute N observations between the all the bins” does not mean the same thing as “N observations are distributed among the bins”. If you are allowed to distribute them as you like initially, then no N exists that requires removing any observations. (Put 1 observation in cell 1 of histo 1, and N-1 in cell 1 of histo 2.)James Waldby - jwpat7
Good point - it is reworded to reflect the intent of the question.Henry David Thorough
Sounds like an NP-hard problem. Approximately how many bins will you typically be dealing with, and about how large N?user2566092
Also, is the number of samples N equal between the two histograms? In your example the two histograms had different numbers of samples.user2566092
@user2566092 Surely not strongly NP-hard -- things get a lot easier after the number of items in each (proportional) histogram is known (quadratically many possibilities if the histograms are given in unary).David Eisenstat

1 Answers

0
votes

Seems to me that the problem is equivalent (if you consider each histogram as a N-dimensional vector), to minimizing the Manhattan length |R|, where R=xA-B, A and B are your 'vectors' and x is your proportional scale.

|R| has a single minimum (not necessarily an integer) so you can find it fairly rapidly using a simple bisection algorithm (or something akin to Newton's method).

Then, assuming you want a solution where the proportion is an integer, test the two cases ceil(x), and floor(x), to find which has the smallest Manhattan length (and that is the number of observations you need to remove).

Proof that the problem is not NP-hard:

Consider an inefficient 'solution' whereby you removed all N observations from all the bins. Now both A and B are equal to the 'zero' histogram 0 = (0,0,0,...). The two histograms are equal and thus proportional as 0 = s * 0 for all proportional values s, so a hard maximum for the number of observations to remove is N.

Now assume a more efficient solution exists with assitions/removals < N and a proportional scale s > 2*N (i.e after removal of some observations A = N * B or B=N * A ). If both A = 0 and B = 0, we have the previous solution with N removals (which contradicts the assumption that there are less than N removals). If A = 0 and B0 then there is no s <> 0 such that 0 = s * B and no s such that s * 0 = B (with a similar argument for B = 0 and S0). So it must be the case that both A0 and B0. Assume for a moment that A is the histogram to be scaled (so A * s = B), A must have at least one non-zero entry A[i] with minimum value 1 (after removal of extra observations), so when scaled it will have minimum value ≥. Therefore the equivalent entry B[i] must also have at least 2*N observations. But the total number of observations was initially N, so we have needed to add at least N observations to B[i], which contradicts the assumption that the improved solution had less than N additions/removals. So no 'efficient' solution requires a proportional scale greater than N.

So to find an efficient solution requires, at worst, testing the 'best fit' solution for scaling factors in the range 0-N.

The 'best fit' solution for scaling factor s in A = s * B, where A and B have M bins each requires

Sum(i=1 to M) of { Abs(A[i]- s * B[i]) mod s + Abs(A[i]- s * B[i]) div s } additions/removals.

This is an order M operation, so to test for each scaling factor in the range 0-N will be an algorithm of order O(M*N)

I am fairly certain (but haven't got a formal proof), that the scale factor cannot exceed the number of observations in the most filled bin. In practice it is typically very much smaller. For two histograms with two hundred bins and randomly chosen 30-300 observations per bin: if there were Na > Nb total observations in all the bins of A and B respectively the scaling factor was either almost always found in the range Na/Nb-4 < s < Na/Nb + 4, (or s = 0 if Na >> Nb).