AMD shift

The double shift instructions SHLD/SHRD are slow on all the AMD chips so we have to make one up using 6 instructions ie (load,copy,shl,shr,or,store) , for the regular x86 instruction set the shifts both require the reg cx for different counts , by using MMX we can avoid the two instruction needed to swap between these two values. Using SSE (on K10) can further speed things up but the writes are still 64bit (which means we dont need to worry about alignment for writes) so they take two ops , and the shifting is split into two 64bit chunks and needs an 1 extra shuffle instruction to put it back together , so we need 8 instructions. If we also require that the memory is access incrementing or decrementing then we also need 1 extra instruction for the lshift decrementing as there is no single SSE instruction for the shuffle needed

For the amount of unrolling all these functions are optimal (bounded by macro-op retirement)
K8 takes 2.0+€ c/w
K10 takes 1.333+€ c/w
K10 lshift decr takes 1.5+€ c/w

k8_lshift_decr_4way runs at 2.166 c/w
k8_rshift_incr_4way runs at 2.166 c/w
k10_lshift_decr_4way runs at 1.666 c/w
k10_lshift_incr_4way runs at 1.5 c/w           DO THIS ??????????
k10_rshift_incr_4wayruns at 1.5 c/w

The wind-down code could be cleaned up somewhat
 
Shifts by a fixed amount can be done a bit faster in some cases and also if the destination=source although the lshift's are incrementing and the rshift's are decrementing.

For the K8 and K10
lshift_by_1_inplace runs at 1.0 c/w
rshift_by_1_inplace runs at 1.0 c/w
lshift_by_1 runs at 1.3 c/w
rshift_by_1 runs at 1.3 c/w

And for the K8 only
lshift_by_2 lshift_by_3 run at
lshift_by_4 lshift_by_5 lshift_by_6 run at
rshift_by_2 runs at

Shifts of byte multiples can also be done faster