Nehalem/Westmere popham

The nehalem/westmere has the new popcnt instruction which has a thruput of 1 per cycle , therefore we should be able to get 1.0c/w for popcount and 2.0c/w for hamdist bound by ld/st.

popcount which runs at 1.0c/w
hamdist which runs at 2.0c/w

Using a mixed int/SSE it should be possible to break the ld/st bound in hamdist and improve on the times of 2.0c/w