Intel add sub
All the Intel chips are limited by the ld/st bandwidth of 1 ld op
per
cycle , so the best we can achieve is 2.0c/w for non-sse and 1.0c/w for
SSE however using SSE is very tricky . We are also limited by the
latency of adc of 2c however we can overcome this by splitting into two
"indep " streams , NOTE existing manuals state thruput of adc is 2c
which is wrong (for example see addadd on core2)
We cant use "inc" the the loop control as this introduces a partial
flags update stall (best we can get is 2.5c/w) , saving and restoring
the carry flag around the loop control with lahf/sahf however
introduces another 2c of latency which with 4-way unroll still gives us
2.5c/w , instead we update the counter with lea (which doesn't effect
flags , and use jrcxz)
add sub both run
2.0 c/w with 4-way unroll