AMD shift
The double shift instructions SHLD/SHRD are slow on all the AMD chips
so we have to make one up using 6 instructions ie
(load,copy,shl,shr,or,store) , for the regular x86 instruction set the
shifts both require the reg cx for different counts , by using MMX we
can avoid the two instruction needed to swap between these two values.
Using SSE (on K10) can further speed things up but the writes are still
64bit (which means we dont need to worry about alignment for writes) so
they take two ops , and the shifting is split into two 64bit chunks and
needs an 1 extra shuffle instruction to put it back together , so we
need 8 instructions. If we also require that the memory is access
incrementing or decrementing then we also need 1 extra instruction for
the lshift decrementing as there is no single SSE instruction for the
shuffle needed
For the amount of unrolling all these functions are optimal (bounded by
macro-op retirement)
K8 takes 2.0+€ c/w
K10 takes 1.333+€ c/w
K10 lshift decr takes 1.5+€ c/w
k8_lshift_decr_4way runs at 2.166 c/w
k8_rshift_incr_4way runs at 2.166 c/w
k10_lshift_decr_4way runs at 1.666 c/w
k10_lshift_incr_4way runs at 1.5
c/w DO THIS
??????????
k10_rshift_incr_4wayruns at 1.5 c/w
The wind-down code could be cleaned up somewhat
Shifts by a fixed amount can be done a bit faster in some cases and
also if the destination=source although the lshift's are incrementing
and the rshift's are decrementing.
For the K8 and K10
lshift_by_1_inplace runs at 1.0 c/w
rshift_by_1_inplace runs at 1.0 c/w
lshift_by_1 runs at 1.3 c/w
rshift_by_1 runs at 1.3 c/w
And for the K8 only
lshift_by_2 lshift_by_3
run at
lshift_by_4 lshift_by_5
lshift_by_6 run at
rshift_by_2 runs at
Shifts of byte multiples can also be done faster