Core2 logic
All the intel chips are limited by the ld/st bandwidth of 1 write and 1
read op per
cycle , so the best we can achieve is 2.0c/w for non-sse and 1.0c/w for
SSE
for non-sse we just use the k8 version which all run at at the best
speed of 2.0c/w
and or xor andn orn xorn nand nor all from the k8
The SSE versions need to be writen