Intel shift
We can use the double shift instructions SHLD/SHRD on the intel chips ,
these take 2 micro-ops giving us an optimal sequence of load,shld,store
consisting of 4 micro-ops which takes 1c in the RAT , so we expect
1.0+€ c/w : NOTE as jump/loop are only processed in pipe 5 (unlike AMD)
we must have a whole number of cycles in the intel loops.
lshift_decr_4way takes 1.25 c/w
rshift_incr_4way takes 1.25 c/w
The wind-down code could be cleaned up somewhat
For the nehalem the latency of shld is 3 or 4 rather than 2 on the
core2 , therefore the 4way runs slower??? , so we need to unroll it
some more???
lshift_decr_4way takes 1.7? c/w
rshift_incr_4way takes 1.7? c/w
NOTE: using SSE we should be able to improve on this as the bound
appears to be 0.875 or 1.0 c/w