AMD adddiag

we can save some time by combining the diag function with an add

A 1-way unroll is limited to 4c/w by the 4c latency of the carry restore,adc,adc,save carry ie

mov (src),AX
mul AX
restore carry
adc AX,(dst)
adc DX,8(dst)
save carry

however due to the scheduler we can only get 4.4c/w , but by using a separate ld/st

mov (src),AX
mul AX
restore carry
adc (dst),AX
adc 8(dst),DX
save carry
mov AX,(dst)
mov DX,8(dst)

this does run at 4c/w

For a 2-way unroll we can take advantage of the fact that the high product can absorb a carry with no further carrys generated to knock 1 cycle off the adc chain latency ,

mov (src),AX
mov $0,r1
mul AX
add r2,-8(dst)
adc AX,(dst)
adc DX,r1

mov (src),AX
mov $0,r2
mul AX
add r1,8(dst)
adc AX,16(dst)
adc DX,r2

here we are limited by adc chain latency to 3c/w and by macro-op/pick hardware to 3c/w and this does run at 3c/w
The 3-way unroll also runs at 3c/w bound by the pick hardware
The 4-way is bound by ld/st to 2.5c/w , by retirement to 2.5c/w , by add latency to 2.5c/w and by pick hardware to 2.75c/w which we can get

adddiag_4way runs at 2.75c/w

With more unrolling we may be able to achieve the optimum 2.5c/w bound by ld/st , but as this function is only used in the sqr_basecase it is not worth it.