AMD add sub
All the AMD chips are limited by the ld/st bandwidth of 2 mem ops per
cycle , so the best we can achieve is 1.5c/w for non-sse and 1.0c/w for
SSE on the K10/K10-2 , however using SSE is very tricky .
add sub both run
1.5 c/w with 4-way unroll