Intel   add sub

All the Intel  chips are limited by the ld/st bandwidth of 1 ld op per cycle , so the best we can achieve is 2.0c/w for non-sse and 1.0c/w for SSE however using SSE is very tricky . We are also limited by the latency of adc of 2c however we can overcome this by splitting into two "indep " streams , NOTE existing manuals state thruput of adc is 2c which is wrong (for example see addadd on core2)

We cant use "inc" the the loop control as this introduces a partial flags update stall (best we can get is 2.5c/w) , saving and restoring the carry flag around the loop control with lahf/sahf however introduces another 2c of latency which with 4-way unroll still gives us 2.5c/w , instead we update the counter with lea (which doesn't effect flags , and use jrcxz)

add  sub both run 2.0 c/w with 4-way unroll