Core2/penryn popham

The core2/penryn does not have a popcnt instruction so we use the usual logic method of calculating popcount for a 1-way unroll code?
The 2-way unroll saves some more logic , and the 3-way some more , the 4-way however overflows our fast "mul method" of summing and so is slower.
We also can use SSE to double bandwidth of the logic that needs to be done , although the trick mul for adding is not as good in SSE form


from the K8 we can use this
 
hamdist_3way runs at 5.9c/w

and using SSE and the psadbw instruction to sum the bytes we can get

popcount_4way runs at 2.75c/w ( should be 2.5c/w just some priming issues ? , and maybe to 2.25c/w with some more scheduling)


MUST do the Hamdist in sse