python - Longest equally-spaced subsequence

Question

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example

1, 4, 5, 7, 8, 12

has a subsequence

   4,       8, 12

My naive method is greedy and just checks how far you can extend a subsequence from each point. This takes O(n²) time per point it seems.

Is there a faster way to solve this problem?

Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as [random.randint(0,100000) for r in xrange(200000)] .

Timings. I tested with the following input data on my 32 bit system.

a= [random.randint(0,10000) for r in xrange(20000)] 
a.sort()

The dynamic programming method of ZelluX uses 1.6G of RAM and takes 2 minutes and 14 seconds. With pypy it takes only 9 seconds! However it crashes with a memory error on large inputs.
The O(nd) time method of Armin took 9 seconds with pypy but only 20MB of RAM. Of course this would be much worse if the range were much larger. The low memory usage meant I could also test it with a= [random.randint(0,100000) for r in xrange(200000)] but it didn't finish in the few minutes I gave it with pypy.

In order to be able to test the method of Kluev's I reran with

a= [random.randint(0,40000) for r in xrange(28000)] 
a = list(set(a))
a.sort()

to make a list of length roughly 20000. All timings with pypy

ZelluX, 9 seconds
Kluev, 20 seconds
Armin, 52 seconds

It seems that if the ZelluX method could be made linear space it would be the clear winner.

I don't understand all the downvotes, and especially not the close votes (off-topic, better suited for Super User? Seriously?). It's an interesting problem that I'd like to see answers to, too. So, +1 from me. — Tim Pietzcker
@user2179021: You might improve your question by including the code you already have. That seems to calm some of the more critical SO users. Don't worry about the downvotes for now. — Tim Pietzcker
@TimPietzcker I'm with you here, I've seen much worser questions, I think it good one here — Roman Pekar
In the example, what decides that 4, 8, 12 is the correct output over 1, 4, 7 which is an equally long sequence? — FThompson

Evgeny Kluev Evgeny Kluev · Accepted Answer · 2013-08-11T10:08:49

Update: First algorithm described here is obsoleted by Armin Rigo's second answer, which is much simpler and more efficient. But both these methods have one disadvantage. They need many hours to find the result for one million integers. So I tried two more variants (see second half of this answer) where the range of input integers is assumed to be limited. Such limitation allows much faster algorithms. Also I tried to optimize Armin Rigo's code. See my benchmarking results at the end.

Here is an idea of algorithm using O(N) memory. Time complexity is O(N² log N), but may be decreased to O(N²).

Algorithm uses the following data structures:

prev: array of indexes pointing to previous element of (possibly incomplete) subsequence.
hash: hashmap with key = difference between consecutive pairs in subsequence and value = two other hashmaps. For these other hashmaps: key = starting/ending index of the subsequence, value = pair of (subsequence length, ending/starting index of the subsequence).
pq: priority queue for all possible "difference" values for subsequences stored in prev and hash.

Algorithm:

Initialize prev with indexes i-1. Update hash and pq to register all (incomplete) subsequences found on this step and their "differences".
Get (and remove) smallest "difference" from pq. Get corresponding record from hash and scan one of second-level hash maps. At this time all subsequences with given "difference" are complete. If second-level hash map contains subsequence length better than found so far, update the best result.
In the array prev: for each element of any sequence found on step #2, decrement index and update hash and possibly pq. While updating hash, we could perform one of the following operations: add a new subsequence of length 1, or grow some existing subsequence by 1, or merge two existing subsequences.
Remove hash map record found on step #2.
Continue from step #2 while pq is not empty.

This algorithm updates O(N) elements of prev O(N) times each. And each of these updates may require to add a new "difference" to pq. All this means time complexity of O(N² log N) if we use simple heap implementation for pq. To decrease it to O(N²) we might use more advanced priority queue implementations. Some of the possibilities are listed on this page: Priority Queues.

See corresponding Python code on Ideone. This code does not allow duplicate elements in the list. It is possible to fix this, but it would be a good optimization anyway to remove duplicates (and to find the longest subsequence beyond duplicates separately).

And the same code after a little optimization. Here search is terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range.

Armin Rigo's code is simple and pretty efficient. But in some cases it does some extra computations that may be avoided. Search may be terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range:

def findLESS(A):
  Aset = set(A)
  lmax = 2
  d = 1
  minStep = 0

  while (lmax - 1) * minStep <= A[-1] - A[0]:
    minStep = A[-1] - A[0] + 1
    for j, b in enumerate(A):
      if j+d < len(A):
        a = A[j+d]
        step = a - b
        minStep = min(minStep, step)
        if a + step in Aset and b - step not in Aset:
          c = a + step
          count = 3
          while c + step in Aset:
            c += step
            count += 1
          if count > lmax:
            lmax = count
    d += 1

  return lmax

print(findLESS([1, 4, 5, 7, 8, 12]))

If range of integers in source data (M) is small, a simple algorithm is possible with O(M²) time and O(M) space:

def findLESS(src):
  r = [False for i in range(src[-1]+1)]
  for x in src:
    r[x] = True

  d = 1
  best = 1

  while best * d < len(r):
    for s in range(d):
      l = 0

      for i in range(s, len(r), d):
        if r[i]:
          l += 1
          best = max(best, l)
        else:
          l = 0

    d += 1

  return best


print(findLESS([1, 4, 5, 7, 8, 12]))

It is similar to the first method by Armin Rigo, but it doesn't use any dynamic data structures. I suppose source data has no duplicates. And (to keep the code simple) I also suppose that minimum input value is non-negative and close to zero.

Previous algorithm may be improved if instead of the array of booleans we use a bitset data structure and bitwise operations to process data in parallel. The code shown below implements bitset as a built-in Python integer. It has the same assumptions: no duplicates, minimum input value is non-negative and close to zero. Time complexity is O(M² * log L) where L is the length of optimal subsequence, space complexity is O(M):

def findLESS(src):
  r = 0
  for x in src:
    r |= 1 << x

  d = 1
  best = 1

  while best * d < src[-1] + 1:
    c = best
    rr = r

    while c & (c-1):
      cc = c & -c
      rr &= rr >> (cc * d)
      c &= c-1

    while c != 1:
      c = c >> 1
      rr &= rr >> (c * d)

    rr &= rr >> d

    while rr:
      rr &= rr >> d
      best += 1

    d += 1

  return best

Benchmarks:

Input data (about 100000 integers) is generated this way:

random.seed(42)
s = sorted(list(set([random.randint(0,200000) for r in xrange(140000)])))

And for fastest algorithms I also used the following data (about 1000000 integers):

s = sorted(list(set([random.randint(0,2000000) for r in xrange(1400000)])))

All results show time in seconds:

Size:                         100000   1000000
Second answer by Armin Rigo:     634         ?
By Armin Rigo, optimized:         64     >5000
O(M^2) algorithm:                 53      2940
O(M^2*L) algorithm:                7       711

python - Longest equally-spaced subsequence

10 Answers

Idea

Pseudocode

Thoughts about memory usage