I'm having a heck of a time trying to come up with a constant time rotate that does not violate the C/C++ standards.
The problem is the edge/corner cases, where operations are called out in algorithms and those algorithms cannot be changed. For example, the following is from Crypto++ and executes the test harness under GCC ubsan (i.e., g++ fsanitize=undefined
):
$ ./cryptest.exe v | grep runtime
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:643:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:625:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:643:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
misc.h:637:22: runtime error: shift exponent 32 is too large for 32-bit type 'unsigned int'
And the code at misc.h:637
:
template <class T> inline T rotlMod(T x, unsigned int y)
{
y %= sizeof(T)*8;
return T((x<<y) | (x>>(sizeof(T)*8-y)));
}
Intel's ICC was particularly ruthless, and it removed the entire function call with out the y %= sizeof(T)*8
. We fixed that a few years back, but left the other errata in-place due to lack of a constant time solution.
There's one pain point remaining. When y = 0
, I get a condition where 32 - y = 32
, and it sets up the undefined behavior. If I add a check for if(y == 0) ...
, then the code fails to meet the constant time requirement.
I've looked at a number of other implementations, from the Linux kernel to other cryptographic libraries. They all contain the same undefined behavior, so it appears to be a dead end.
How can I perform the rotate in nearly constant time with a minimum number of instructions?
EDIT: by near constant time, I mean avoid the branch so the same instructions are always executed. I'm not worried about CPU microcode timings. While branch prediction may be great on x86/x64, it may not perform as well on other platforms, like embedded.
None of these tricks would be required if GCC or Clang provided an intrinsic to perform the rotate in near constant time. I'd even settle for "perform the rotate" since they don't even have that.
r
constraint for the value and then referring to the value by%0
or similar. – Matti Virkkunen