AMD logic

All the AMD chips are limited by the ld/st bandwidth of 2 mem ops per cycle , so the best we can achieve is 1.5c/w for non-sse and 1.0c/w for SSE on the K10/K10-2 , however for the un-aligned not  logic varients  we are also limited by  macro-op retirement to 1.0+€ c/w

and or xor all use a 4-way unroll with pointer updates to make for a small wind-down

andn orn xorn all use a 4-way unroll (wind down could be simplified)

nand nor all use a 4-way unroll but its pipelined to achieve best speed and note the funny feedin code required to "prime" the ld/st units correctly

The SSE version for the K10 needs to be writen