We can use the double shift instructions SHLD/SHRD on the intel chips , these take 2 micro-ops giving us an optimal sequence of load,shld,store consisting of 4 micro-ops which takes 1c in the RAT , so we expect 1.0+€ c/w : NOTE as jump/loop are only processed in pipe 5 (unlike AMD) we must have a whole number of cycles in the intel loops.

lshift_decr_4way takes 1.25 c/w

rshift_incr_4way takes 1.25 c/w

The wind-down code could be cleaned up somewhat

For the nehalem the latency of shld is 3 or 4 rather than 2 on the core2 , therefore the 4way runs slower??? , so we need to unroll it some more???

lshift_decr_4way takes 1.7? c/w

rshift_incr_4way takes 1.7? c/w

NOTE: using SSE we should be able to improve on this as the bound appears to be 0.875 or 1.0 c/w