All the intel chips are limited by the ld/st bandwidth of 1 write and 1 read per cycle , so the best we can achieve is 1.0c/w for non-sse and 0.5c/w for SSE

the non sse copyd from the k8 achieves 1.0c/w

but to get to 0.5c/w we have to use SSE and use three(can we get away with 2?) versions

1) when src and dst are both aligned or both un-aligned

2) when src is aligned and dst is un-aligned

3) when src is un-aligned and dst is aligned

here is the incrementing version copyi the decrementing version should be the same

The wind-down code could be shortened , and we could use movaps to save a few bytes