Core2/Penryn store

Core2/Penryn/nehalem/Westmere chips are limited by the write bandwidth of 1 64bit word per cycle , or 1 128bit word per cycle so the best we can achieve is 0.5c/w with SSE or 1.0c/w without SSE

On the Core2/Penryn a 1-way SSE (2-way word) unroll achieves this speed
store