Nehalem/Westmere store

Core2/Penryn/nehalem/Westmere chips are limited by the write bandwidth of 1 64bit word per cycle , or 1 128bit word per cycle so the best we can achieve is 0.5c/w with SSE or 1.0c/w without SSE

Unlike the Core2 on nehalem ( and Westmere?)  a 1-way SSE (2-way word) unroll does NOT achieves this speed , but a 2-way unroll(4-way word) does , this is because of the loopback buffer which introduces a 1-cycle delay , however we would expect 3/4=0.75c/w to be the best speed?