AMD add  sub

All the AMD chips are limited by the ld/st bandwidth of 2 mem ops per cycle , so the best we can achieve is 1.5c/w for non-sse and 1.0c/w for SSE on the K10/K10-2 , however using SSE is very tricky .

add  sub both run 1.5 c/w with 4-way unroll