AFAICT, the intrinsic leaves garbage in index
when the input is zero, weaker than the behaviour of the asm instruction. This is why it has a separate boolean return value and integer output operand.
Despite the index
arg being taken by reference, the compiler treats it as output-only.
unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
Intel's intrinsics guide documentation for the same intrinsic seems clearer than the Microsoft docs you linked, and sheds some light on what the MS docs are trying to say. But on careful reading, they do both seem to say the same thing, and describe a thin wrapper around the bsr
instruction.
Intel documents the BSR
instruction as producing an "undefined value" when the input is 0, but setting the ZF in that case. But AMD documents it as leaving the destination unchanged:
AMD's BSF entry in AMD64 Architecture
Programmer’s Manual
Volume 3:
General-Purpose and
System Instructions
... If the second operand contains 0, the instruction sets ZF
to 1 and does not change the contents of the destination register. ...
On current Intel hardware, the actual behaviour matches AMD's documentation: it leaves the destination register unmodified when the src operand is 0. Perhaps this is why MS describes it as only setting Index
when the input is non-zero (and the intrinsic's return value is non-zero).
On Intel (but maybe not AMD), this goes as far as not even truncating a 64-bit register to 32-bit. e.g. mov rax,-1
; bsf eax, ecx
(with zeroed ECX) leaves RAX=-1 (64-bit), not the 0x00000000ffffffff
you'd get from xor eax, 0
. But with non-zero ECX, bsf eax, ecx
has the usual effect of zero-extending into RAX, leaving for example RAX=3.
IDK why Intel still hasn't documented it. Perhaps a really old x86 CPU (like original 386?) implements it differently? Intel and AMD frequently go above and beyond what's documented in the x86 manuals in order to not break existing widely-used code (e.g. Windows), which might be how this started.
At this point it seems unlikely that Intel will ever drop that output dependency and leave actual garbage or -1 or 32 for input=0, but the lack of documentation leaves that option open.
Skylake dropped the false dependency for lzcnt
and tzcnt
(and a later uarch dropped the false dep for popcnt
) while still preserving the dependency for bsr
/bsf
. (Why does breaking the "output dependency" of LZCNT matter?)
Of course, since MSVC optimized away your index = 0
initialization, presumably it just uses whatever destination register it wants, not necessarily the register that held the previous value of the C variable. So even if you wanted to, I don't think you could take advantage of the dst-unmodified behaviour even though it's guaranteed on AMD.
So in C++ terms, the intrinsic has no input dependency on index
. But in asm, the instruction does have an input dependency on the dst register, like an add dst, src
instruction. This can cause unexpected performance issues if compilers aren't careful.
Unfortunately on Intel hardware, the popcnt / lzcnt / tzcnt
asm instructions also have a false dependency on their destination, even though the result never depends on it. Compilers work around this now that it's known, though, so you don't have to worry about it when using intrinsics (unless you have a compiler more than a couple years old, since it was only recently discovered).
You need to check it to make sure index
is valid, unless you know the input was non-zero. e.g.
if(_BitScanReverse64(&idx, input)) {
// idx is valid.
// (MS docs say "Index was set")
} else {
// input was zero, idx holds garbage.
// (MS docs don't say Index was even set)
idx = -1; // might make sense, one lower than the result for bsr(1)
}
If you want to avoid this extra check branch, you can use the lzcnt
instruction via different intrinsics if you're targeting new enough hardware (e.g. Intel Haswell or AMD Bulldozer IIRC). It "works" even when the input is all-zero, and actually counts leading zeros instead of returning the index of the highest set bit.
assert
fails, right? It's been a while since I've used VS, but shouldn't theassert
be disabled on release builds? – eranindex
will be set if no bit is set; it only saysindex
will be set with a value if a bit is set. So, if no bit is set, _BitScanReverse64 could just be leavingindex
alone and it's left with whatever it originally had. – Altainia_BitScanReverse64
. If it's zero, the value ofindex
is undefined (at least in the docs). Assuming so because only a non-zero return indicatesindex
was set. – eranindex
is initialized to 0, but_BitScanReverse64
might be changing it internally. The unoptimized version is possibly settingindex
to zero if no 1 is found, but the optimized version omits that part to save time. Both are in line with the docs ("Nonzero if Index was set, or 0 if no set bits were found.") and the optimized version does less hence is faster. – eran