Precision in Sum reduction kernel with floats

Question

I am creating a routine that calls the Sum Reduction kernel of Nvidia (reduction6), but when I compare the results between the CPU and GPU get an error that increases as the vector size increases, so:

Both CPU and GPU reductions are floats

Size: 1024  (Blocks : 1,  Threads : 512)
Reduction on CPU:  508.1255188 
Reduction on GPU:  508.1254883 
Error:  6.0059137e-06

Size: 16384 (Blocks : 8, Threads : 1024)
Reduction on CPU:  4971.3193359 
Reduction on GPU:  4971.3217773 
Error:  4.9109825e-05

Size: 131072 (Blocks : 64, Threads : 1024)
Reduction on CPU:  49986.6718750 
Reduction on GPU:  49986.8203125 
Error:  2.9695415e-04

Size: 1048576 (Blocks : 512, Threads : 1024)
Reduction on CPU:  500003.7500000 
Reduction on GPU:  500006.8125000 
Error:  6.1249541e-04

Any idea about this error?, thanks.

I don't see errors, I see differences. And probably completely normal and to be expected differences. Write a double precision sum reduction and run it on the CPU, then compared the CPU and GPU single precision results to the double precision results. You will probably be surprised by the results..... — talonmies
@talonmies Should not be the same fro double precision? Or it depends on the GPU architecture? — pQB
@pQB: Perhaps you misunderstood. What I mean is that both single precision solutions (GPU and CPU) should be compared with a double or even quad precision reference solution, rather than with each other. — talonmies
It's not wise to assume or expect that two different floating point machines will produce identical results. Even when both advertise an adherence to a standard, like IEEE 754, there are a variety of settings that may be different between the two, as well as 2 different compilers that are selecting and ordering the actual machine level instructions. This is a long paper, but it highlights many of the topics involved in understanding a floating point implementation in a language. In particular, read the conclusion. — Robert Crovella
Ok @talonmies, but, What is the reason for this difference?, Should not be the same if both routines have the same precision, thanks? — user2093311

harrism harrism · Accepted Answer · 2013-03-01T03:05:50

Floating point addition is not necessarily associative.

This means that when you change the order of operations of your floating-point summation, you may get different results. Parallelizing a summation by definition changes the order of operations of the summation.

There are many ways to sum floating-point numbers, and each has accuracy benefits for different input distributions. Here's a decent survey.

Sequential summation in the given order is rarely the most accurate way to sum, so if that is what you are comparing against, don't expect it to compare well to the tree-based summation used in a typical parallel reduction.

Precision in Sum reduction kernel with floats

1 Answers