Nehalem logic

All the intel chips are limited by the ld/st bandwidth of 1 write and 1 read op per cycle , so the best we can achieve is 2.0c/w for non-sse and 1.0c/w for SSE

for non-sse we just use the k8 version which all run at at the best speed of 2.0c/w

as nehalem unaligned loads are fast we can just go straight to the SSE versions
and or xor andn orn xorn nand nor  which mainly run at the optimal 1.0c/w but for some alignments they run at 1.3c/w