Core2/Penryn diag

The schoolbook version of the square function can require a diag function (the multiplications that are done only once)

On the core2/Penryn we have the mul thruput limit of 4c/w which we can achieve

diag_1way runs at 4.0c/w