Nehalem/Westmere mul_1
The add chain latency of add+adc is 3c therefore 3c/w is the best that can be done , because of the loopback latency we need at least a 3-way unroll to achieve this , however the best so far is 3.333c/w
mul_1
runs at 3.333c/w