K8 popham

AMD K8 popham

The K8 does not have a popcnt instruction so we use the usual logic method of calculating popcount for a 1-way unroll code?
The 2-way unroll saves some more logic , and the 3-way some more , the 4-way however overflows our fast "mul method" of summing and so is slower.

1-way?

popcount_2way runs at 5.5c/w
popcount_3way runs at 4.666c/w

hamdist_2way runs at 5.5c/w
hamdist_3way runs at 5.0c/w

For our given popham algorithm and chosen unrolling these implementations are optimal.

However using MMX which has the instruction psadbw we can sum 8 bytes in 1 instruction instead of two (imul const,reg , shr 56,reg) however most of the instructions needed will only use two pipes (FMUL,FADD , the FMISC pipe doesnt' do much ) as we are nowhere near being ld/st bound , a mixed regular/mmx could cut it as we can unroll to any amount.