Core2 copy
All the intel chips are limited by the ld/st bandwidth of 1 write and 1
read per
cycle , so the best we can achieve is 1.0c/w for non-sse and 0.5c/w for
SSE
the non sse copyd
from the k8 achieves 1.0c/w
but to get to 0.5c/w we have to use SSE and use three(can we get away
with 2?) versions
1) when src and dst are both aligned or both un-aligned
2) when src is aligned and dst is un-aligned
3) when src is un-aligned and dst is aligned
here is the incrementing version copyi the
decrementing version should be the same
The wind-down code could be shortened , and we could use movaps to save
a few bytes