AMD diag
The schoolbook version of the square function can require a diag
function (the multiplications that are done only once)
On the AMD we have the mul limit of 2c/w , however as we require 2
macro-ops for the loop control at least of the muls in a loop requires
3 cycles due to the restrictions of the pick hardware , so we expect to
get timings of 2+€ c/w which we do
diag_4way which runs at 2.25c/w