AMD logic
All the AMD chips are limited by the ld/st bandwidth of 2 mem ops per
cycle , so the best we can achieve is 1.5c/w for non-sse and 1.0c/w for
SSE on the K10/K10-2 , however for the un-aligned not logic
varients we are also limited by macro-op retirement to
1.0+€ c/w
and or xor all use a 4-way unroll with pointer updates
to make for a small wind-down
andn orn xorn all use a 4-way unroll (wind down could be
simplified)
nand nor all use a
4-way unroll but its pipelined to achieve best speed and note the funny
feedin code required to "prime" the ld/st units correctly
The SSE version for the K10 needs to be writen