AMD inc/decmul_1 mul_1

inc/decmul_1 is bound by 2.333+€ c/w by macro-op retirement of the pipelined sequence

mov $0,r2
mov (src),AX
mul CX
add r0,-8(dst)    // for the decmul_1 version replace this add with a sub
adc AX,r1
adc DX,r2

mul_1 is very similar

mov $0,r2
mov (src),AX
mul CX
mov r0,-8(dst)
add AX,r1
adc DX,r2

and has the same bound(and runs at the same speed). The incmul_1 sequence is very limiting for the schedulers and pick hardware (the mul_1 has more flexibility)

A 3-way unroll (which is the minimum for incmul_1) runs at 2.666c/w and a 4-way unroll runs at 2.5c/w

A version which has 4 cases in the epilog to handle the leftovers
incmul_1_4way runs at 2.5c/w code size 407bytes

A version which jumps into the middle of the loop to handle the leftovers
incmul_1_4way_jmpin runs at 2.5c/w code size 278bytes

mul_1 has a minimum unroll of 2 , so it is possible that a 2x2-way unroll could be found which uses 2 less registers.

The next step is a 7-way unroll which runs at 2.428c/w , the mul_1 is easy to find but for the incmul_1 the schedulers have a hard time as all the operations are in the ALU's , we can reduce this pressure on the ALU's by replacing the instruction mov $0,reg by lea (reg0,reg0,2),reg , where reg0=0 , this uses the AGU instead of the ALU(even on K10, hence the addressing mode)

A version which has 7 cases in the epilog to handle the leftovers
incmul_1_7way runs at 2.428c/w

A version which has 7 cases in the epilog to handle the leftovers and uses a small jump table to choose them
incmul_1_7way_jmpepi runs at 2.428c/w

A version which jumps into the middle of the loop to handle the leftovers
incmul_1_7way_jmpin runs at 2.428c/w

These arn't really practical so no effort has been made to optimize them. Code size is the problem for the first 2 and calculating size%7 for the last one.

Further unrolling may be possible , 5x2, 4x4, 2x8 -way unrolls giving 2.4, 2.375, 2.375 c/w