AMD K10/K10-2

The K10 has the new popcnt instruction which is in the single ABM execution unit , therefore we should be able to get 1.0c/w , however for hamdist we are also limited by the retirement of macro-ops to 1.333+€ c/w

For popcount macro-op retirement and 2c per loop says that a 2-way unroll is the minimum , pick hardware implies pipelining is needed and we can get the optimal 1.0c/w popcount

Strickly 1.0c/w for popcount is not optimal as we have some spare execution slots we could fill in with a traditional K8(or SSE core2) popcount ,this would let us do a something like a 30way unroll for a 10% speedup ie 0.90c/w

For hamdist a 4-way unroll gives us 1.5c/w hamdist