Core2 mul_2
The Core2/penryn mul instruction has a thruput of 1 mul every 4c so
8c/w is optimal for non-sse code , and we can get this with a 3-way
unroll , a 2-way unroll achieves 8.1c/w and if we disallow src==dst we
can get a 2-way to run at 8c/w , this suggests that with a bit tweeking
we can get the general 2-way unroll to also run at 8c/w
mul_2 3-way runs at 8.0c/w
mul_2 2-way runs at 8.1c/w