1
votes

I have three large sets of vectors: A, B1 and B2. These sets are stored in files on disk. For each vector a from A I need to check whether it may be presented as a = b1 + b2, where b1 is from B1 and b2 is from B2. Vectors have 20 components, and all components are non-negative numbers.

How I'm solving this problem now (pseudocode):

foreach a in A  
  foreach b1 in B1
    for i = 1 to 20
      bt[i] = a[i] - b1[i]
      if bt[i] < 0 then try next b1
    next i

    foreach b2 in B2
      for i = 1 to 20
        if bt[i] != b2[i] then try next b2
      next i

      num_of_expansions++
    next b2
  next b1
next a

My questions:
1. Any ideas on how to make if faster?
2. How to make it in parallel?
3. Questions 1, 2 for the case when I have B1, B2, ..., Bk, k > 2?

1
What about hashing? This isn't fully thought out, but i'm thinking of this: (1) hash all combinations of b1 + b2 and put the sum, b1, and b2 in the hashtable. Then check each element of A. If it's in the table, then grab the b1+b2 combination. Don't know what you'll do if you need the index...hash that?Matthew
How big are the sets? Can you read all of them once from the disk and then keep them in memory?svick
@mattkc: thanks, this is something to consider. @svick: set A has millions of vectors, each of B's contains hundred thousand vectors. Yes, I can load them into memory and thus make the search faster.Leo

1 Answers

1
votes

You can sort B1 and B2 by norm. If a = b1 + b2, then ||a|| = ||b1 + b2|| <= ||b1|| + ||b2||, so for any a and b1, you can efficiently eliminate all elements of B2 that have norm < ||a|| - ||b1||. There may also be some way to use the distribution of norms in B1 and B2 to decide whether to switch the roles of the two sets in this. (I don't see off-hand how to do it, but it seems to me that something like this should hold if the distributions of norms in B1 and B2 are significantly different.)

As for making it parallel, it seems that each loop can be turned into a parallel computation, since all computations of one inner iteration are independent of all other iterations.

EDIT

Continuing the analysis: since b2 = a - b1, we also have ||b2|| <= ||a|| + ||b1||. So for any given a and b1, you can restrict the search in B2 to those elements with norms in the range ||a|| ± ||b1||. This suggests that for B1 you should select the set with the smallest average norm.