AMD adddiag
we can save some time by combining the diag function with an add
A 1-way unroll is limited to 4c/w by the 4c latency of the carry
restore,adc,adc,save carry ie
mov (src),AX
mul AX
restore carry
adc AX,(dst)
adc DX,8(dst)
save carry
however due to the scheduler we can only get 4.4c/w , but by using a
separate ld/st
mov (src),AX
mul AX
restore carry
adc (dst),AX
adc 8(dst),DX
save carry
mov AX,(dst)
mov DX,8(dst)
this does run at 4c/w
For a 2-way unroll we can take advantage of the fact that the high
product can absorb a carry with no further carrys generated to knock 1
cycle off the adc chain latency ,
mov (src),AX
mov $0,r1
mul AX
add r2,-8(dst)
adc AX,(dst)
adc DX,r1
mov (src),AX
mov $0,r2
mul AX
add r1,8(dst)
adc AX,16(dst)
adc DX,r2
here we are limited by adc chain latency to 3c/w and by macro-op/pick
hardware to 3c/w and this does run at 3c/w
The 3-way unroll also runs at 3c/w bound by the pick hardware
The 4-way is bound by ld/st to 2.5c/w , by retirement to 2.5c/w , by
add latency to 2.5c/w and by pick hardware to 2.75c/w which we can get
adddiag_4way runs at 2.75c/w
With more unrolling we may be able to achieve the optimum 2.5c/w bound
by ld/st , but as this function is only used in the sqr_basecase it is
not worth it.