AMD K10/K10-2
The K10 has the new popcnt instruction which is in the single ABM
execution unit , therefore we should be able to get 1.0c/w , however
for hamdist we are also limited by the retirement of macro-ops to
1.333+€ c/w
For popcount macro-op retirement and 2c per loop says that a 2-way
unroll is the minimum , pick hardware implies pipelining is needed and
we can get the optimal 1.0c/w popcount
Strickly 1.0c/w for popcount is not optimal as we have some spare
execution slots we could fill in with a traditional K8(or SSE core2)
popcount ,this would let us do a something like a 30way unroll for a
10%
speedup ie 0.90c/w
For hamdist a 4-way unroll gives us 1.5c/w hamdist