Core2/Penryn diag
The schoolbook version of the square function can require a diag function (the multiplications that are done only once)
On the core2/Penryn we have the mul thruput limit of 4c/w which we can achieve
diag_1way
runs at 4.0c/w