All the AMD chips are limited by the write bandwidth of 2 64bit words
per cycle , so the best we can achieve is 0.5c/w
As a loop takes a minimum of 2cycles we need to unroll to at least
4-way to achieve this.
With a 4-way unroll store_4way
the best we can get is
K8 is 0.75c/w and K10/K10-2 is 0.5c/w to 0.75c/w depending on
There appears to be a number of issues stopping us getting to 0.5c/w
The code is 25 bytes long which because it contains a 16byte boundary
it requires an extra cycle (K8 only) so we get to 3/4=0.75c/w this
matches what we can measure , however by using movaps and the xmm
registers the code size can be brought down to <=16 bytes , but the
code then runs at 1.0c/w . I thik the real reason is processor bugs ,
as some K10's are faster than others and the only change is the
Unrolling to 8-way gets us down to 0.5625c/w on the K8 and 0.5c/w on
the K10/K10-2 as the
loop takes 4cycle then 5 cycles then 4 cycles then 5cycles , due to a
mismatch in scheduling. To correct this mismatch we need to insert two
extra macro-ops , to make sure each loop starts in the same state.
store_8way_lea The lea version seams to
slower on some chips.
Although the speed seems to of been lost with the computed goto???