Core2 mul_1
The Core2/penryn mul instruction has a thruput of 1 mul every 4c so 4c/w is optimal for non-sse code , and we can get this with a 1-way unroll
mul_1