Core2 inc/decmul_1

The Core2/penryn mul instruction has a thruput of 1 mul every 4c so 4c/w is optimal for non-sse code , our current best code is a 4-way unroll

incmul_1 4-way unroll runs at 4.7c/w

incmul_1_pipeline 4-way unroll runs at 4.0c/w however for <20 words this is slower the the code above , so it's use in mul basecase is limited , but we can use it for the linear part of toom-3,4 etc multiplication