Nehalem/Westmere diag

The thruput for mul is 2c/w and we also require 2c for the 2 stores needed per word , but because of the 1 cycle added in the loopback buffer , the best we can get is 2.0+€ c/w

diag_4way runs at 2.25c/w