Core2 mul_2

The Core2/penryn mul instruction has a thruput of 1 mul every 4c so 8c/w is optimal for non-sse code , and we can get this with a 3-way unroll , a 2-way unroll achieves 8.1c/w and if we disallow src==dst we can get a 2-way to run at 8c/w , this suggests that with a bit tweeking we can get the general 2-way unroll to also run at 8c/w

mul_2 3-way runs at 8.0c/w

mul_2 2-way runs at 8.1c/w