During the last days I observed a behaviour of my new workstation I couldn't explain. Doing some research on this problem, there might be a possible bug in the INTEL Haswell architecture as well as in the current Skylake Generation.
Before writing about the possible bug, let me give you an overview of the hardware used, the program code and the problem itself.
Workstation hardware specification
- INTEL Xeon E5-2680 V3 2500MHz 30M Cache 12Core
- Supermicro SC745 BTQ -R1K28B-SQ
- 4 x 32GB ECC Registered DDR4-2133 Ram
- INTEL SSD 730 Series 480 GB
- NVIDIA Tesla C2075
- NVIDIA TITAN
Operating system & program code in question
I'm currently running Ubuntu 15.04 64bit Desktop version, latest updates and kernel stuff installed. Besides using this machine to develop CUDA Kernels and stuff, I recently tested a pure C program.
The program is doing sort of modified ART on quite large input sets of data. So the code executes some FFTs and consumes quite some time to finish calculation. I can't currently post / link to any source
code as this is ongoing research that cannot be published. If you're not familiar with ART, just a simple explanation what it does. ART is a technique used to reconstruct the data received from a computer tomograph machine to get
visible images for diagnosis. So our version of the code reconstructs data sets of sizes like 2048x2048x512. Up until now, nothing too special nor rocket science involved. After some hours of debugging and fixing errors, the code was tested
on reference results and we can confirm the code works as it is supposed to. The only library the code is using is standard math.h
. No special compile parameters, no additional library stuff that might bring in additional problems.
Observing the problem
The code implements ART using a technique to minimize the projections needed for reconstructing the data. So let's assume we can reconstruct one slice of data involving 25 projections. The code is started with exactly the same input data on 12 cores. Please note that the implementation is not based on multithreading, currently 12 instances of the program are launched. I know this isn't the best way to do it, involving proper thread management is heavily advised and this is already on the list of improvements :)
So when we run at least two instances of the program (every instance working on a separate data slice), the results are of some projections are wrong in a random fashion. To give you an idea of the results, please see Table1. Please note that the input data is always the same.
Running only one instance of the code involving one core of the CPU, the results are all correct. Even performing some runs involving one CPU core, the results remain correct. Only involving at least two or more cores generates a result pattern as seen in Table1.
Identifying the problem
Okay this took quite some hours to get an idea of what is actually going wrong. So we went through the whole code, most of those problems begin with a minor implementation mistake. But, well, no (of course we cannot proof the absence of bugs nor guarantee it). To verify our code, we used two different machines:
- (Machine1) Intel Core i5 Quad-Core (Model from late 2009)
- (Machine2) Virtual Machine running on Intel XEON 6core SandyBridge CPU
surprisingly, both Machine1 & Machine2 produce always correct results. Even using all CPU-cores, the results remain correct. Not even one wrong result in over 50 runs on every machine. Code was compiled on every target machine without optimization options or any specific compiler settings. So, reading the news led to the following findings:
- ArsTechnika - Skylake CPU freezes during complex workload
- PcWorld - how to test your PC for the skylake bug
- Intel Community - Simple instruction for freezing a Skylake Processor
So the folks over at Prime95 and the Mersenne Community seem to be the first ones to discover and identify this nasty bug. The referenced postings and news support the suspicion, that the problem only exists under heavy workload. Following my observation, I can confirm this behavior.
The question(s)
- Have you / the community observed this problem on Haswell CPUs as well as on Skylake CPUs?
- As gcc does per default AVX(2) optimization (whenever possible), turning off this optimization would help?
- How can I compile my code and ensure, that any optimization that might be affected by this bug is turned off? So far I read only about a problem using the AVX2 command set in Haswell / Skylake architectures.
Solutions?
Okay I can turn off all AVX2 optimizations. But this slows down my code. Intel might release a BIOS update to mainboard manufactures that would modify the microcode in Intel CPUs. As it seems to be a hardware bug, this might become interesting even by updating the CPUs microcode. I think it might be a valid option, as Intel CPUs use some RISC to CISC translation mechanisms controlled by Microcode.
EDIT: Techreport.com - Errata prompts Intel to disable TSX in Haswell, early Broadwell CPUs Will check the microcode version in my CPU.
EDIT2: As of now (19.01.2016 15:39 CET) Memtest86+ v4.20 is running and testing the memory. As this seems to take quite some time to finish, I'll update the post tomorrow with results.
EDIT3: As of now (21.01.2016 09:35 CET) Memtest86+ finished two runs and passed. Not even one memory error. Updated the microcode of the CPU from revision=0x2d
to revision=0x36
. Currently preparing source code for releasing here. Problem with the wrong results consists. As I'm not the author of the code in question, I have to double check not to post code I'm not allowed to. I'm also using the workstation and maintaining it.
EDIT4: (22.01.2016) (12:15 CET) Here ist the Makefile used to compile the sourcecode:
# VARIABLES ==================================================================
CC = gcc
CFLAGS = --std=c99 -Wall
#LDFLAGS = -lm -lgomp -fast -s -m64
LDFLAGS = -lm
OBJ = ArtReconstruction2Min.o
# RULES AND DEPENDENCIES ====================================================
# linking all object files
all: $(OBJ)
$(CC) -o ART2Min $(OBJ) $(LDFLAGS)
# every o-file depends on the corresonding c-file, -g Option bedeutet Debugging Informationene setzen
%.o: %.c
$(CC) -c -g $< $(CFLAGS)
# MAKE CLEAN =================================================================
clean:
rm -f *.o
rm -f main
and the gcc -v
output:
gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.9.2-10ubuntu13' --with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.9 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.9.2 (Ubuntu 4.9.2-10ubuntu13)