Benchmarks¶
The benchmark results were produced by the scripts in examples/benchmarks
, e.g.:
examples/benchmarks/generate_configuration.sh lennard_jones
examples/benchmarks/run_benchmark.sh lennard_jones
The Tesla GPUs had ECC enabled, no overclocking or other tweaking was done.
Simple Lennard-Jones fluid in 3 dimensions¶
Parameters:
- 64,000 particles, number density
- force: lennard_jones ()
- integrator: verlet (NVE, )
Hardware | time per MD step and particle | steps per second | FP precision | compilation details |
---|---|---|---|---|
Intel Xeon E5-2637v4 | 750 ns | 20.9 | double | GCC 8.3.0, -O3 |
NVIDIA Tesla V100-PCI | 5.6 ns | 2790 | double-single | CUDA 9.2, -arch compute_61 |
5.1 ns | 3020 | single | CUDA 9.2, -arch compute_61 | |
NVIDIA GeForce RTX 2080 S | 5.9 ns | 2640 | double-single | CUDA 9.2, -arch compute_61 |
5.6 ns | 2810 | single | CUDA 9.2, -arch compute_61 | |
NVIDIA GeForce RTX 2070 | 7.7 ns | 2030 | double-single | CUDA 9.2, -arch compute_61 |
7.0 ns | 2220 | single | CUDA 9.2, -arch compute_61 | |
NVIDIA A40 | 3.5 ns | 4436 | double-single | CUDA 11.5, -arch compute_80 |
2.9 ns | 5412 | single | CUDA 11.5, -arch compute_80 | |
NVIDIA A100 | 4.6 ns | 3371 | double-single | CUDA 11.5, -arch compute_80 |
3.9 ns | 4028 | single | CUDA 11.5, -arch compute_80 |
Results were obtained from 1 independent measurement based on release version 1.0.0. Each run consisted of NVT equilibration at over (10⁴ steps), followed by benchmarking 10⁴ NVE 5 times steps in a row.
Supercooled binary mixture (Kob-Andersen)¶
Parameters:
256,000 particles, number density
force: lennard_jones with 2 particle species (80% , 20% )
(, , , , , , , , neighbour list occupancy: 70%)
integrator: verlet (NVE, )
Hardware | time per MD step and particle | steps per second | FP precision | compilation details |
---|---|---|---|---|
Intel Xeon E5-2637v4 | 744 ns | 5.25 | double | GCC 10.2.1, -O3 |
NVIDIA Tesla V100-PCI | 3.83 ns | 1020 | double-single | CUDA 9.2, -arch compute_61 |
3.65 ns | 1070 | single | CUDA 9.2, -arch compute_61 | |
NVIDIA GeForce RTX 2080 S | 5.17 ns | 755 | double-single | CUDA 9.2, -arch compute_61 |
4.87 ns | 802 | single | CUDA 9.2, -arch compute_61 | |
NVIDIA GeForce RTX 2070 | 6.63 ns | 589 | double-single | CUDA 9.2, -arch compute_61 |
6.28 ns | 621 | single | CUDA 9.2, -arch compute_61 | |
NVIDIA A40 | 2.56 ns | 1528 | double-single | CUDA 11.5, -arch compute_80 |
2.26 ns | 1728 | single | CUDA 11.5, -arch compute_80 | |
NVIDIA A100 | 3.10 ns | 1260 | double-single | CUDA 11.5, -arch compute_80 |
2.91 ns | 1343 | single | CUDA 11.5, -arch compute_80 |
Results were obtained from 1 independent measurement and are based on release version 1.0.0. Each run consisted of NVT equilibration at over (2×10⁴ steps), followed by benchmarking 10⁴ NVE steps 5 times in a row.
Variant “tiny”¶
This benchmark tests an alternative implementation of the calculation of pair forces, using loop-unrolling. It is particularly suited for systems with small particle number.
Parameters:
- 4,096 particles, all other parameters are as above
- neighbour lists are constructed directly, without binning to Verlet cells
- re-ordering of particle data in memory is disabled
- double-single floating point precision is enabled
Hardware | time per MD step and particle | steps per second | unroll force loop | compilation details |
---|---|---|---|---|
NVIDIA A40 | 17.9 ns | 13638 | true | CUDA 11.5, -arch compute_80 |
30.5 ns | 8013 | false | CUDA 11.5, -arch compute_80 | |
NVIDIA A100 | 22.6 ns | 10823 | true | CUDA 11.5, -arch compute_80 |
38.8 ns | 6285 | false | CUDA 11.5, -arch compute_80 |
Results were obtained from 1 independent measurement and are based on the pre-release version 1.0.0-67-g24afb4c68. Each run consisted of NVT equilibration at over (2×10⁴ steps), followed by benchmarking 10⁴ NVE steps 5 times in a row.