Nehalem/Westmere store
Core2/Penryn/nehalem/Westmere chips are limited by the write bandwidth
of 1 64bit word
per cycle , or 1 128bit word per cycle so the best we can achieve is
0.5c/w with SSE or 1.0c/w without SSE
Unlike the Core2 on nehalem ( and Westmere?) a 1-way SSE (2-way
word) unroll does NOT achieves this speed , but a 2-way unroll(4-way
word) does , this is because of the loopback buffer which introduces a
1-cycle delay , however we would expect 3/4=0.75c/w to be the best
speed?
store