core2 mul

Core2 mul_1

The Core2/penryn mul instruction has a thruput of 1 mul every 4c so 4c/w is optimal for non-sse code , and we can get this with a 1-way unroll

mul_1