Nehalem logic
All the intel chips are limited by the ld/st bandwidth of 1 write and 1
read op per
cycle , so the best we can achieve is 2.0c/w for non-sse and 1.0c/w for
SSE
for non-sse we just use the k8 version which all run at at the best
speed of 2.0c/w
as nehalem unaligned loads are fast we can just go straight to the SSE
versions
and or xor andn orn xorn nand nor which
mainly run at the optimal 1.0c/w but for some alignments they run at
1.3c/w