You have a memory race in your code. This:
sum+= 2.0*3.0;
potentially allows multiple threads to simultaneously accumulate to the sum. In your example both threads attempted to load and store at the same address at the same time. This is undefined behaviour in CUDA.
The usual way to avoid this problem is algorithm redesign. Just don't have multiple threads writing to the same memory location. There is a very widely described shared memory reduction technique you can use for accumulating sums from shared memory arrays without memory races.
Alternatively, there are atomic memory access primitives which can be used to serialise memory access. Your example is double precision floating point, for which I am fairly sure there is no intrinsic atomic add function. The programming guide includes an example of a user space atomic add for double precision. depending on your hardware, it may or may not be usable on a shared memory variable, as 64 bit shared memory atomic operations are only supported on compute capability 2.x and 3.x devices. In any case, atomic memory operations should be used sparingly because serialising memory access greatly reduces performance.