The Code Cavern


Welcome
to the Code Cavern , an underground repository of highly optimized
assembler code.
(mining for optimal sequences)


LOG

To
fully take advantage of the cpu's we typically have to write different
versions for each micro-architecture , we consider those below as x64.

Micro-architecture
Family
Model
Comments
K8
15
any
There are a number of subdivisions mainly todo with L1 Data cache access , I consider the latest one.
K10
16
2
Like the latest K8 but SSE is faster and a few new instructions
K10-2
16
4 5 6 8 9 10
45nm , appears to be a tweeked K10 with  improved load-store
K10-3
18
any
32nm k10-2 plus a hardware divider aka lano
K8+
17
any
fusion of K8 and GPU
Bobcat
20
any
fusion of bobcat and gpu
Bulldozer
21
any

Core2
6
15 22
65nm
Penryn
6
23 29
45nm , like Core2 but div and shuffle are faster
Nehalem
6
26 30 31 46
45nm
Westmere
6
37 44 47
32nm , like Nehalem but with new instructions
SandyBridge
6
42
32nm with AVX
Atom
6
28
Only one model
Nano
6
15
No specific code yet

A good source of reference is
Agner Fogs Software
optimization resources


We assume that all the data is in the L1-data cache and we operate on words of 64bits aligned on the natural 64bit boundary.



Data format
Name
Description
AKA
BDA
binary digit array
array of "words"
GMP's mpn
IBDA
interleaved binary digit array
interleaving of 2 or more BDA's

TBDA
truncated binary digit array
array of truncated "word"
GMP's mpn with nails
ITBDA
interleaved truncated binary digit array
interleaving of 2 or more TBDA's

DA
digit array
array of digits
FLINT's small polynominals ?


Function definitions  def.h
Some notes: On relative alignment
Cycles per word Table

function
K8
K10/K10-2
Bobcat
Core2/penryn
Nehalem/Westmere
Sandybridge
Atom
store
0.5625
0.5

0.5
0.5


copy
1.0


0.5 or 1.0
0.7?


com not
1.25 1.0






logic
1.5


2.0
1.3


popham
4.666 5.0
1.0 1.5

2.75_5.9
1.0 2.0


shift
2.166 1.0
1.666 1.0

1.25
1.7?


add/sub
1.5


2.0



mul_1
2.5 2.428


4.0
3.333


mul_2
4.666 4.5


8.0



inc/decmul_1
2.5 2.428


4.7 4.0



diag
2.25


4.0
2.25


adddiag
2.75






mod_1_1
7.0


13.0 13.3
12.0


mod_1_2
3.5


6.2
6.0



Note: Some of this code is released under the LGPL


Cycles for mul

CPU
RAX
RDX
Thruput
Micro-ops Macro-ops
K8 K10 K10-2
4
5
2
2
Bobcat
6
7
5
2
Bulldozer


6?

Core2 Penryn
7
8
4
3
Nehalem Westmere
3
10
2
3
Sandybridge
3
4
1
2
Atom




Nano