0
votes

Solved this ages ago, but this question recently got some attention so I'm adding my solution (and some clarification). See below.


I have a list of 32 bit integers, and I want to find a subset of them for which ((item ^ bits) & mask) == 0 is true. The problem is, bits and mask can both take any value (mask is biased towards having not very many bits set though), so I can't simply pre-calculate all possible combinations.

The list is not very big, typically around 500 items, so things that look good in theory (such as a binary tree, where for each bit in the mask a whole subtree can be skipped) are actually slow in practice. Even a tree with just two levels was just a bit slower than the naive approach, despite skipping a significant number of tests.

Currently I iterate over the entire list and test every item. It's fast once, but it happens millions of times, every time with different bits and mask so caching the result doesn't help. This part of the program takes just over 40% of the total CPU time it uses.

foreach (var row in validRows.Keys)
{
    // this single line here takes 40% of the total program time, according to ANTS 5
    if (((oldrow ^ row) & oldused) == 0)
    // the other magic takes no significant time, according to ANTS
    {
        if (y > 1 && ((((row ^ prev) | yminone) + 1) >> rows.Length) == 0)
        {
            continue;
        }
        if (dict.ContainsKey(row))
            continue;

        dict.Add(row, true);

        rows[y] = row;
        count(y + 1, dict);   // this is a recursive call.

        dict.Remove(row);
    }
}

I gathered some statistics. It turns out that over 130000 of the 179000 queries return only 1 item. That sounds like an opportunity for some kind of optimization to me, but I'm not sure how or what.


For this particular sub-problem, preprocessing helped a lot. I now make an array of possibilities for every row in the input, which is just validRows (now an array instead of a dictionary) filtered by (int row) => ((inputrow ^ row) & inputfilledmask) == 0.

The actual problem is, given a partially filled 8x8 boolean matrix, to count all assignments that satisfy the following rules:

  1. 4 ones and 4 zeroes in every row
  2. 4 ones and 4 zeroes in every column
  3. No more than 2 ones next to each other vertically or horizontally
  4. No more than 2 zeroes next to each other vertically or horizontally
  5. No two rows may be equal
  6. No two columns may be equal

Here's how I solve it now:

For every row, filter the list of the 34 valid rows down to the ones which could be assigned on top of it (ie all the bits that correspond to ones in the mask that marks the filled cells are equal in the input row and the row that could be assigned on top of it).

Then recursively, fill in the next row with every possible row at that point. That means it must be in the filtered list, it must not have been used yet (ie not in the hash set), and a test is done to ensure that it doesn't violate the 3rd of 4th rules. To prune some more sub-trees of the recursion tree, I also use two extra integers, one to keep track of the number of zeroes in every column (one nibble per column), and one to keep track of the ones. Potential rows that would violate the second rule are ignored by a concise test. These integers are initialized corresponding to the input (plus 0x33333333, so that 4 ones in a column don't set the highest bit of the nibble but 5 or more do), and updated only with cells that were previously empty.

Finally, at the bottom of the recursion tree, the last row is completely dictated by the two integers that count the ones and zeroes in the columns (even just one of them is enough to determine the last row). Then a test is done for duplicate columns - that's the only thing that isn't automatically guaranteed by construction. In all, the time went down from about a minute to about a tenth of a second (usually less).

I'm still open to suggestions (although that would make this a completely different question, really). This counting routine is used to generate "good" initial configurations by brute force - the faster it runs, the better the result can be in the allotted time.

2
What do you know about the values? For example, are the low bits likely to be uniformly distributed? What are the likely values for mask?Jon Skeet
It's hard to suggest any optimisations without knowing any characteristics of the data or the purpose of the calculations. Can you specify for example if you are masking out a single bit, a few bits or most of the bits? What is the calculation for? Does a general result include just a few of the numbers or most of the numbers? Handling the result (for example creating a List object and add numbers to it) could be more work than doing the calculations, so how do you want the result?Guffa
Are bits and mask 32-bit as well?FlyingStreudel
@Jon Skeet: mask is likely to have around zero to 5 bits set (though it can be more, it is not likely), the values never have more than 2 same bits in a row in the lower bits (ie never "000" or "1111"), but the upper bits (say the 16 upper most bits) can be all zero.harold
bits and mask are 32 bits as well. The result does not have to be in a list, I just iterate over them and use them one by one. On average, 17% of the items are returned for a query.harold

2 Answers

1
votes

I think that you are concentrating on the wrong part of the solution. Although a lot of the CPU time is in this loop, it can't be optimised much in itself.

I did a test with precalculated lists containing the integers where the bits were set or cleared for specific bits, and combining the lists based on the values in bits and mask. Although it was pretty fast, it's still enough overhead to make it take ten times longer than just calculating the values.

You have to look outside the loop at what the data actually means to find a way to elliminate some of the work.

0
votes

You are asking for a better data structure, but what if there isn't one?

You could consider taking a step back and looking at your problem to see if you can use multiple threads or parallel constructs that would enable you to use more than one processor at a time.