Benchmarks¶
The benchmark results were produced by the scripts in examples/benchmarks
, e.g.:
examples/benchmarks/generate_configuration.sh lennard_jones
examples/benchmarks/run_benchmark.sh lennard_jones
The Tesla GPUs had ECC enabled, no overclocking or other tweaking was done.
Simple Lennard-Jones fluid in 3 dimensions¶
Parameters:
64,000 particles, number density \(\rho = 0.4\sigma^3\)
force: lennard_jones (\(r_c = 3\sigma, r_\text{skin} = 0.7\sigma\))
integrator: verlet (NVE, \(\delta t^* = 0.002\))
Hardware |
time per MD step and particle |
steps per second |
FP precision |
compilation details |
---|---|---|---|---|
Intel Xeon E5-2637v4 |
750 ns |
20.9 |
double |
GCC 8.3.0, -O3 |
NVIDIA H100 |
3.4 ns |
4546 |
double-single |
CUDA 11.2, -arch compute_80 |
3.3 ns |
4666 |
single |
CUDA 11.2, -arch compute_80 |
|
NVIDIA A40 |
3.5 ns |
4436 |
double-single |
CUDA 11.5, -arch compute_80 |
2.9 ns |
5412 |
single |
CUDA 11.5, -arch compute_80 |
|
NVIDIA A100 |
4.6 ns |
3371 |
double-single |
CUDA 11.5, -arch compute_80 |
3.9 ns |
4028 |
single |
CUDA 11.5, -arch compute_80 |
|
NVIDIA Tesla V100-PCI |
5.6 ns |
2790 |
double-single |
CUDA 9.2, -arch compute_61 |
5.1 ns |
3020 |
single |
CUDA 9.2, -arch compute_61 |
|
NVIDIA GeForce RTX 2080 S |
5.9 ns |
2640 |
double-single |
CUDA 9.2, -arch compute_61 |
5.6 ns |
2810 |
single |
CUDA 9.2, -arch compute_61 |
|
NVIDIA GeForce RTX 2070 |
7.7 ns |
2030 |
double-single |
CUDA 9.2, -arch compute_61 |
7.0 ns |
2220 |
single |
CUDA 9.2, -arch compute_61 |
Results were obtained from 1 independent measurement based on release version 1.0.0. Each run consisted of NVT equilibration at \(T^*=1.2\) over \(\Delta t^*=100\) (10⁴ steps), followed by benchmarking 10⁴ NVE steps 5 times steps in a row.
Supercooled binary mixture (Kob-Andersen)¶
Parameters:
256,000 particles, number density \(\rho = 1.2\sigma^3\)
force: lennard_jones with 2 particle species (80% \(A\), 20% \(B\))
(\(\epsilon_{AA}=1\), \(\epsilon_{AB}=1.5\), \(\epsilon_{BB}=.5\), \(\sigma_{AA}=1\), \(\sigma_{AB}=.8\), \(\sigma_{BB}=.88\), \(r_c = 2.5\sigma\), \(r_\text{skin} = 0.3\sigma\), neighbour list occupancy: 70%)
integrator: verlet (NVE, \(\delta t^* = 0.001\))
Hardware |
time per MD step and particle |
steps per second |
FP precision |
compilation details |
---|---|---|---|---|
Intel Xeon E5-2637v4 |
744 ns |
5.25 |
double |
GCC 10.2.1, -O3 |
NVIDIA H100 |
1.79 ns |
2178 |
double-single |
CUDA 11.2, -arch compute_80 |
1.74 ns |
2238 |
single |
CUDA 11.2, -arch compute_80 |
|
NVIDIA A40 |
2.56 ns |
1528 |
double-single |
CUDA 11.5, -arch compute_80 |
2.26 ns |
1728 |
single |
CUDA 11.5, -arch compute_80 |
|
NVIDIA A100 |
3.10 ns |
1260 |
double-single |
CUDA 11.5, -arch compute_80 |
2.91 ns |
1343 |
single |
CUDA 11.5, -arch compute_80 |
|
NVIDIA Tesla V100-PCI |
3.83 ns |
1020 |
double-single |
CUDA 9.2, -arch compute_61 |
3.65 ns |
1070 |
single |
CUDA 9.2, -arch compute_61 |
|
NVIDIA GeForce RTX 2080 S |
5.17 ns |
755 |
double-single |
CUDA 9.2, -arch compute_61 |
4.87 ns |
802 |
single |
CUDA 9.2, -arch compute_61 |
|
NVIDIA GeForce RTX 2070 |
6.63 ns |
589 |
double-single |
CUDA 9.2, -arch compute_61 |
6.28 ns |
621 |
single |
CUDA 9.2, -arch compute_61 |
Results were obtained from 1 independent measurement and are based on release version 1.0.0. Each run consisted of NVT equilibration at \(T^*=0.7\) over \(\Delta t^*=100\) (2×10⁴ steps), followed by benchmarking 10⁴ NVE steps 5 times in a row.
Variant “tiny”¶
This benchmark tests an alternative implementation of the calculation of pair forces, using loop-unrolling. It is particularly suited for systems with small particle number.
Parameters:
4,096 particles, all other parameters are as above
neighbour lists are constructed directly, without binning to Verlet cells
re-ordering of particle data in memory is disabled
double-single floating point precision is enabled
Hardware |
time per MD step and particle |
steps per second |
unroll force loop |
compilation details |
---|---|---|---|---|
NVIDIA H100 |
23.6 ns |
10319 |
yes |
CUDA 11.2, -arch compute_80 |
43.7 ns |
5578 |
no |
CUDA 11.2, -arch compute_80 |
|
NVIDIA A40 |
17.9 ns |
13638 |
yes |
CUDA 11.5, -arch compute_80 |
30.5 ns |
8013 |
no |
CUDA 11.5, -arch compute_80 |
|
NVIDIA A100 |
22.6 ns |
10823 |
yes |
CUDA 11.5, -arch compute_80 |
38.8 ns |
6285 |
no |
CUDA 11.5, -arch compute_80 |
Results were obtained from 1 independent measurement and are based on the pre-release version 1.0.0-67-g24afb4c68. Each run consisted of NVT equilibration at \(T^*=0.7\) over \(\Delta t^*=100\) (2×10⁴ steps), followed by benchmarking 10⁴ NVE steps 5 times in a row.