AMD inc/decmul_1 mul_1
inc/decmul_1 is bound by 2.333+€ c/w by macro-op retirement of the
pipelined sequence
mov $0,r2
mov (src),AX
mul CX
add r0,-8(dst) // for the decmul_1 version replace
this add with a sub
adc AX,r1
adc DX,r2
mul_1 is very similar
mov $0,r2
mov (src),AX
mul CX
mov r0,-8(dst)
add AX,r1
adc DX,r2
and has the same bound(and runs at the same speed). The incmul_1
sequence is very limiting for the schedulers and pick hardware (the
mul_1 has more flexibility)
A 3-way unroll (which is the minimum for incmul_1) runs at 2.666c/w and
a 4-way unroll runs at 2.5c/w
A version which has 4 cases in the epilog to handle the leftovers
incmul_1_4way runs at 2.5c/w code size
407bytes
A version which jumps into the middle of the loop to handle the
leftovers
incmul_1_4way_jmpin runs at
2.5c/w code size 278bytes
mul_1 has a minimum unroll of 2 , so it is possible that a 2x2-way
unroll could be found which uses 2 less registers.
The next step is a 7-way unroll which runs at 2.428c/w , the mul_1 is
easy to find but for the incmul_1 the schedulers have a hard time as
all the operations are in the ALU's , we can reduce this pressure on
the ALU's by replacing the instruction mov $0,reg by lea
(reg0,reg0,2),reg , where reg0=0 , this uses the AGU instead of the
ALU(even on K10, hence the addressing mode)
A version which has 7 cases in the epilog to handle the leftovers
incmul_1_7way runs at 2.428c/w
A version which has 7 cases in the epilog to handle the leftovers and
uses a small jump table to choose them
incmul_1_7way_jmpepi runs at
2.428c/w
A version which jumps into the middle of the loop to handle the
leftovers
incmul_1_7way_jmpin runs at
2.428c/w
These arn't really practical so no effort has been made to optimize
them. Code size is the problem for the first 2 and calculating size%7
for the last one.
Further unrolling may be possible , 5x2, 4x4, 2x8 -way unrolls giving
2.4, 2.375, 2.375 c/w