AMD copy

All the AMD chips are limited by the ld/st bandwidth of 2 mem ops per cycle , so the best we can achieve is 1.0c/w for non-sse and 0.75 for SSE on the K10/K10-2

A 4-way unroll achieves the optimal non-SSE speed in both incrementing and decrementing versions

a 2-way unroll may work but as with store on the K8  we expect problems reaching the optimal speed

The wind-down code could be cleaned up somewhat

The SSE version for the K10 needs to be writen