This is simply a limitation in the micro-architecture of your Intel Haswell CPU and several previous1 CPUs. It has been fixed for tzcnt
and lzcnt
as of Skylake-S (client), but the issue remained for popcnt
until it was fixed in Cannon Lake.
On those micro-architectures the destination operand for tzcnt
, lzcnt
and popcnt
is treated as an input dependency even though, semantically, it is not. Now I doubt this is a really a "bug": if it was simply an unintended behavior/oversight, I expect it would have been fixed in one of the several new micro-architectures that have been released since it was introduced.
Most likely it is a design compromise based on one or both of the following two factors:
The hardware for popcnt
, lzcnt
and tzcnt
is likely all shared with the existing bsf
and bsr
instructions. Now bsf
and bsr
do have a dependency on the previous destination value in practice2 for the special case of all-bits-zero input, since Intel chips leave the destination unmodified in that case. So it is entirely possible that the simplest design for the combined hardware resulted in the other similar instructions executing on the same unit inheriting the same dependency.
The vast majority of x86 two operand ALU instructions have an dependency on the destination operand, since it is used as a source as well. The three affected instructions are somewhat unique in that they are unary operators, but unlike existing unary operators like not
and neg
which have a single operand used as source and destination, they have distinct source and destination operands, making them superficially similar to most 2-input instructions. Perhaps the renamer/scheduler circuitry just doesn't distinguish the special case of these unary-with-two-register-operand versus the vast majority of plain shared source/destination 2-input instructions which don't have this dependency.
In fact, for the case of popcnt
Intel has issued various errata covering the false dependency issue such as HSD146 for Haswell Desktop and SKL029 for Skylake, which reads:
POPCNT Instruction May Take Longer to Execute Than Expected
Problem POPCNT instruction execution with a 32 or 64 bit operand may be
delayed until previous non -dependent instructions have executed.
Implication Software using the POPCNT instruction may experience lower performance than expected.
Workaround None identified
I always found this erratum unusual since it isn't really identifying any type of functional defect or non-conformance to specification which is the case for essentially all the other errata. Intel doesn't really document a specific performance model for the OoO execution engine and there are a ton of other performance "gotchas" that have appeared and disappeared over the years (many with a much larger impact than this very minor issue) that don't get documented in errata. Still, this perhaps provides some evidence that it can be considered a bug. Oddly, the erratum was never extended to include tzcnt
or lzcnt
which had the same issue when they were introduced.
1 Well tzcnt
and lzcnt
only appeared in Haswell, but the problem exists for popcnt
as well which was introduced in Nehalem - but the false dependency problem perhaps only exists for Sandy Bridge or later.
2In practice, although not documented in the ISA docs, since the result for all-zero input was undefined in the Intel manuals. Most or all Intel chips implemented the behavior as leaving the destination register unchanged in this case, however.
AMD does document and guarantee that behaviour for bsf
and bsr
.
(But unfortunately those instructions are slower than tzcnt
/lzcnt
on AMD (extra uops, see https://uops.info/), so instead of taking advantage of that bsf
behaviour, it would often be better for AMD CPUs to use rep bsf
so it will decode as tzcnt
on CPUs that know about that instruction, and test
/cmov
if you have enough free registers. But bsr
gives different results to lzcnt
even for non-zero input, so you might consider taking advantage of it.)
lzcnt
can't be executed speculatively (it updates CF, ZF) afterjnz
(ZF != 0). Whereasxor
breaks the dependency chain? But sinceadd
would retire the previous flags anyway, I'm not if this is the case. – Brett Halenop
instead ofxor eax, eax
made it slow again – haroldLZCNT
should have an input dependency on its output, but it does. ThePOPCNT
instruction has the same bug, as described in detail here. – Cody Gray♦