The nehalem/westmere has the new popcnt instruction which has a thruput
of 1 per cycle , therefore we should be able to get 1.0c/w for popcount
and 2.0c/w for hamdist bound by ld/st.
popcount which runs at 1.0c/w
hamdist which runs at 2.0c/w
Using a mixed int/SSE it should be possible to break the ld/st bound in
hamdist and improve on the times of 2.0c/w