Core2 inc/decmul_1
The Core2/penryn mul instruction has a thruput of 1 mul every 4c so
4c/w is optimal for non-sse code , our current best code is a 4-way
unroll
incmul_1 4-way unroll runs at 4.7c/w
incmul_1_pipeline 4-way unroll runs
at 4.0c/w however for <20 words this is slower the the code above ,
so it's use in mul basecase is limited , but we can use it for the
linear part of toom-3,4 etc multiplication