AMD store

All the AMD chips are limited by the write bandwidth of 2 64bit words per cycle , so the best we can achieve is 0.5c/w
As a loop takes a minimum of 2cycles we need to unroll to at least 4-way to achieve this.

With a 4-way unroll store_4way
the best we can get is
K8 is 0.75c/w and K10/K10-2  is 0.5c/w to 0.75c/w depending on stepping

There appears to be a number of issues stopping us getting to 0.5c/w
The code is 25 bytes long which because it contains a 16byte boundary it requires an extra cycle (K8 only) so we get to 3/4=0.75c/w this matches what we can measure , however by using movaps and the xmm registers the code size can be brought down to <=16 bytes , but the code then runs at 1.0c/w . I thik the real reason is processor bugs , as some K10's are faster than others and the only change is the stepping.

Unrolling to 8-way gets us down to 0.5625c/w on the K8 and 0.5c/w on the K10/K10-2 as the loop takes 4cycle then 5 cycles then 4 cycles then 5cycles , due to a mismatch in scheduling. To correct this mismatch we need to insert two extra macro-ops , to make sure each loop starts in the same state.

store_8way_lea The lea version seams to slower on some chips.
Although the speed seems to of been lost with the computed goto???