AMD com not

Both com and not are bound by ld/st to 1c/w which we can achieve for not , for com we are also bound by macro-op retirement so we can get 1+€ c/w

com runs at 1.25c/w with a 4-way unroll
not runs at 1.0c/w with a 2-way unroll

For the K10 we can use SSE to speed things up , the ld/st bound is 0.75c/w