AMD K8 popham
The K8 does not have a popcnt instruction so we use the usual logic
method of calculating popcount for a 1-way unroll code?
The 2-way unroll saves some more logic , and the 3-way some more , the
4-way however overflows our fast "mul method" of summing and so is
slower.
1-way?
popcount_2way runs at 5.5c/w
popcount_3way runs at 4.666c/w
hamdist_2way runs at 5.5c/w
hamdist_3way runs at 5.0c/w
For our given popham algorithm and chosen unrolling these
implementations are optimal.
However using MMX which has the instruction psadbw we can sum 8 bytes
in 1 instruction instead of two (imul const,reg , shr 56,reg) however
most of the instructions needed will only use two pipes (FMUL,FADD ,
the FMISC pipe doesnt' do much ) as we are nowhere near being ld/st
bound , a mixed regular/mmx could cut it as we can unroll to any amount.