Core2/penryn popham
The core2/penryn does not have a popcnt instruction so we use the usual
logic
method of calculating popcount for a 1-way unroll code?
The 2-way unroll saves some more logic , and the 3-way some more , the
4-way however overflows our fast "mul method" of summing and so is
slower.
We also can use SSE to double bandwidth of the logic that needs to be
done , although the trick mul for adding is not as good in SSE form
from the K8 we can use this
hamdist_3way runs at 5.9c/w
and using SSE and the psadbw instruction to sum the bytes we can get
popcount_4way runs at 2.75c/w ( should be
2.5c/w just some priming issues ? , and maybe to 2.25c/w with some more
scheduling)
MUST do the Hamdist in sse