AMD diag

The schoolbook version of the square function can require a diag function (the multiplications that are done only once)
On the AMD we have the mul limit of 2c/w , however as we require 2 macro-ops for the loop control at least of the muls in a loop requires 3 cycles due to the restrictions of the pick hardware , so we expect to get timings of 2+€ c/w  which we do

diag_4way which runs at 2.25c/w