4
votes

I came across a segmented implementation of sieve of Eratosthenes which promises to run many times faster than the conventional version. Can someone please explain how segmentation improves the running time? Note that I want to find prime numbers in [1,b].

It works on this idea: (for finding prime numbers till 10^9)

  • We first generate the sieving primes below sqrt(10^9) which are needed to cross-off multiples. Then we start crossing-off the multiples of the first prime 2 until we reach a multiple of 2 >= segment_size, if this happens we calculate the index of that multiple in the next segment using (multiple - segment_size) and store it in a separate array (next[]). We then cross-off the multiples of the next sieving primes using the same procedure. Once we have crossed-off the multiples of all sieving primes in the first segment we iterate over the sieve array and print out (or count) the primes.

  • In order to sieve the next segment we reset the sieve array and we increment the lower offset by segment_size. Then we start crossing-off multiples again, for each sieving prime we retrieve the sieve index from the next array and we start crossing-off multiples from there on...

2
The point is to sieve one segment (e.g. 32k) at a time, rather than all the numbers. This works much better with the cache.T.C.

2 Answers

9
votes

A segmented sieve does all the same operations as a regular sieve, so the big-O time complexity is unchanged. The difference is in the use of memory. If the sieve is small enough to fit in memory, there is no difference. As the size of the sieve increases, locality of reference becomes a factor, so the sieving process slows down. In the extreme case, if the sieve doesn't fit in memory and has to be paged to disk, the sieving process will become very slow. A segmented sieve keeps the memory size constant, and presumably small, so all accesses of the sieve are local, thus fast.

6
votes

Even if the sieve fits completely into RAM, locality of access still makes a big difference. A C++ implementation of the odds-only Eratosthenes takes almost half a minute to sieve the first 2^32 numbers; the same implementation initialising the same sieve in small, 256 KByte segments (2^21 bits which represent 2^22 numbers) takes only 8.5 seconds on my aging Nehalem with its 256 KByte L2 cache.

The speed-up from sieving in small cache-friendly segments diminishes in the higher regions of the range, since the sieve has to iterate over all factors up to sqrt(n) every time, regardless of how small or big the segment is. This is most notable for segments close to 2^64 where the small factors comprise 203,280,221 primes (i.e. a full 32-bit sieve).

Segmented operation still beats full sieving, though. You can sieve segments close to 2^64 to the tune of millions of primes per second, tens of millions per second in lower regions. That's counting the primes, not the raw number ore. Full sieves are just not practical beyond 2^32 or so, even if you have oodles of memory.