Nehalem copy

All the intel chips are limited by the ld/st bandwidth of 1 write and 1 read per cycle , so the best we can achieve is 1.0c/w for non-sse and 0.5c/w for SSE

The penalty for unaligned loads and store on the nehalem chip are very low so we try a 4-way unroll (2-way sse)