Core2/Penryn store
Core2/Penryn/nehalem/Westmere chips are limited by the write bandwidth
of 1 64bit word
per cycle , or 1 128bit word per cycle so the best we can achieve is
0.5c/w with SSE or 1.0c/w without SSE
On the Core2/Penryn a 1-way SSE (2-way word) unroll achieves this speed
store