Welcome
to the Code Cavern , an underground repository of highly optimized
assembler code.
(mining for optimal sequences)
To
fully take advantage of the cpu's we typically have to write different
versions for each micro-architecture , we consider those below as x64.
Micro-architecture |
Family |
Model |
Comments |
K8 |
15 |
any |
There are a number of
subdivisions mainly todo with L1 Data cache access , I consider the
latest one. |
K10 |
16 |
2 |
Like the latest K8 but SSE is
faster and a few new instructions |
K10-2 |
16 |
4 5 6 8 9 10 |
45nm , appears to be a tweeked
K10 with improved load-store |
K10-3 |
18 |
any |
32nm k10-2 plus a hardware
divider aka lano |
K8+ |
17 |
any |
fusion of K8 and GPU |
Bobcat |
20 |
any |
fusion of bobcat and gpu |
Bulldozer |
21 |
any |
|
Core2 |
6 |
15 22 |
65nm |
Penryn |
6 |
23 29 |
45nm , like Core2 but div and
shuffle are faster |
Nehalem |
6 |
26 30 31 46 |
45nm |
Westmere |
6 |
37 44 47 |
32nm , like Nehalem but with new
instructions |
SandyBridge |
6 |
42 |
32nm with AVX |
Atom |
6 |
28 |
Only one model |
Nano |
6 |
15 |
No specific code yet |
A good source of reference is
Agner Fogs Software
optimization resources
We assume that all the data is in the L1-data cache and we operate on
words of 64bits aligned on the natural 64bit boundary.
Data format |
Name |
Description |
AKA |
BDA |
binary digit array |
array of "words" |
GMP's mpn |
IBDA |
interleaved binary digit array |
interleaving of 2 or more BDA's |
|
TBDA |
truncated binary digit array |
array of truncated "word" |
GMP's mpn with nails |
ITBDA |
interleaved truncated binary
digit array |
interleaving of 2 or more TBDA's |
|
DA |
digit array |
array of digits |
FLINT's small polynominals ? |
Function definitions def.h
Some notes: On relative alignment
Cycles per word Table
function |
K8 |
K10/K10-2 |
Bobcat |
Core2/penryn |
Nehalem/Westmere |
Sandybridge |
Atom |
store |
0.5625 |
0.5 |
0.5 |
0.5 |
|||
copy |
1.0 |
0.5 or
1.0 |
0.7? |
||||
com not |
1.25 1.0 |
||||||
logic |
1.5 |
2.0 |
1.3 |
||||
popham |
4.666
5.0 |
1.0
1.5 |
2.75_5.9 |
1.0
2.0 |
|||
shift |
2.166
1.0 |
1.666 1.0 |
1.25 |
1.7? |
|||
add/sub |
1.5 |
2.0 |
|||||
mul_1 |
2.5
2.428 |
4.0 |
3.333 |
||||
mul_2 |
4.666 4.5 |
8.0 |
|||||
inc/decmul_1 |
2.5
2.428 |
4.7
4.0 |
|||||
diag |
2.25 |
4.0 |
2.25 |
||||
adddiag |
2.75 |
||||||
mod_1_1 |
7.0 |
13.0 13.3 |
12.0 |
||||
mod_1_2 |
3.5 |
6.2 |
6.0 |
Note: Some of this code is released under the LGPL
Cycles for mul
CPU |
RAX |
RDX |
Thruput |
Micro-ops Macro-ops |
K8 K10 K10-2 |
4 |
5 |
2 |
2 |
Bobcat |
6 |
7 |
5 |
2 |
Bulldozer |
6? |
|||
Core2 Penryn |
7 |
8 |
4 |
3 |
Nehalem Westmere |
3 |
10 |
2 |
3 |
Sandybridge |
3 |
4 |
1 |
2 |
Atom |
||||
Nano |