Nehalem copy
All the intel chips are limited by the ld/st bandwidth of 1 write and 1
read per
cycle , so the best we can achieve is 1.0c/w for non-sse and 0.5c/w for
SSE
The penalty for unaligned loads and store on the nehalem chip are very
low so we try a 4-way unroll (2-way sse)
copyd
copyi