Benchmarks

The benchmark results were produced by the scripts in examples/benchmarks, e.g.:

examples/benchmarks/generate_configuration.sh lennard_jones
examples/benchmarks/run_benchmark.sh lennard_jones

The Tesla GPUs had ECC enabled, no overclocking or other tweaking was done.

Simple Lennard-Jones fluid in 3 dimensions

Parameters:

  • 64,000 particles, number density \(\rho = 0.4\sigma^3\)

  • force: lennard_jones (\(r_c = 3\sigma, r_\text{skin} = 0.7\sigma\))

  • integrator: verlet (NVE, \(\delta t^* = 0.002\))

Hardware

time per MD step and particle

steps per second

FP precision

compilation details

Intel Xeon E5-2637v4

750 ns

20.9

double

GCC 8.3.0, -O3

NVIDIA H100

3.4 ns

4546

double-single

CUDA 11.2, -arch compute_80

3.3 ns

4666

single

CUDA 11.2, -arch compute_80

NVIDIA A40

3.5 ns

4436

double-single

CUDA 11.5, -arch compute_80

2.9 ns

5412

single

CUDA 11.5, -arch compute_80

NVIDIA A100

4.6 ns

3371

double-single

CUDA 11.5, -arch compute_80

3.9 ns

4028

single

CUDA 11.5, -arch compute_80

NVIDIA Tesla V100-PCI

5.6 ns

2790

double-single

CUDA 9.2, -arch compute_61

5.1 ns

3020

single

CUDA 9.2, -arch compute_61

NVIDIA GeForce RTX 2080 S

5.9 ns

2640

double-single

CUDA 9.2, -arch compute_61

5.6 ns

2810

single

CUDA 9.2, -arch compute_61

NVIDIA GeForce RTX 2070

7.7 ns

2030

double-single

CUDA 9.2, -arch compute_61

7.0 ns

2220

single

CUDA 9.2, -arch compute_61

Results were obtained from 1 independent measurement based on release version 1.0.0. Each run consisted of NVT equilibration at \(T^*=1.2\) over \(\Delta t^*=100\) (10⁴ steps), followed by benchmarking 10⁴ NVE steps 5 times steps in a row.

Supercooled binary mixture (Kob-Andersen)

Parameters:

  • 256,000 particles, number density \(\rho = 1.2\sigma^3\)

  • force: lennard_jones with 2 particle species (80% \(A\), 20% \(B\))

    (\(\epsilon_{AA}=1\), \(\epsilon_{AB}=1.5\), \(\epsilon_{BB}=.5\), \(\sigma_{AA}=1\), \(\sigma_{AB}=.8\), \(\sigma_{BB}=.88\), \(r_c = 2.5\sigma\), \(r_\text{skin} = 0.3\sigma\), neighbour list occupancy: 70%)

  • integrator: verlet (NVE, \(\delta t^* = 0.001\))

Hardware

time per MD step and particle

steps per second

FP precision

compilation details

Intel Xeon E5-2637v4

744 ns

5.25

double

GCC 10.2.1, -O3

NVIDIA H100

1.79 ns

2178

double-single

CUDA 11.2, -arch compute_80

1.74 ns

2238

single

CUDA 11.2, -arch compute_80

NVIDIA A40

2.56 ns

1528

double-single

CUDA 11.5, -arch compute_80

2.26 ns

1728

single

CUDA 11.5, -arch compute_80

NVIDIA A100

3.10 ns

1260

double-single

CUDA 11.5, -arch compute_80

2.91 ns

1343

single

CUDA 11.5, -arch compute_80

NVIDIA Tesla V100-PCI

3.83 ns

1020

double-single

CUDA 9.2, -arch compute_61

3.65 ns

1070

single

CUDA 9.2, -arch compute_61

NVIDIA GeForce RTX 2080 S

5.17 ns

755

double-single

CUDA 9.2, -arch compute_61

4.87 ns

802

single

CUDA 9.2, -arch compute_61

NVIDIA GeForce RTX 2070

6.63 ns

589

double-single

CUDA 9.2, -arch compute_61

6.28 ns

621

single

CUDA 9.2, -arch compute_61

Results were obtained from 1 independent measurement and are based on release version 1.0.0. Each run consisted of NVT equilibration at \(T^*=0.7\) over \(\Delta t^*=100\) (2×10⁴ steps), followed by benchmarking 10⁴ NVE steps 5 times in a row.

Variant “tiny”

This benchmark tests an alternative implementation of the calculation of pair forces, using loop-unrolling. It is particularly suited for systems with small particle number.

Parameters:

  • 4,096 particles, all other parameters are as above

  • neighbour lists are constructed directly, without binning to Verlet cells

  • re-ordering of particle data in memory is disabled

  • double-single floating point precision is enabled

Hardware

time per MD step and particle

steps per second

unroll force loop

compilation details

NVIDIA H100

23.6 ns

10319

yes

CUDA 11.2, -arch compute_80

43.7 ns

5578

no

CUDA 11.2, -arch compute_80

NVIDIA A40

17.9 ns

13638

yes

CUDA 11.5, -arch compute_80

30.5 ns

8013

no

CUDA 11.5, -arch compute_80

NVIDIA A100

22.6 ns

10823

yes

CUDA 11.5, -arch compute_80

38.8 ns

6285

no

CUDA 11.5, -arch compute_80

Results were obtained from 1 independent measurement and are based on the pre-release version 1.0.0-67-g24afb4c68. Each run consisted of NVT equilibration at \(T^*=0.7\) over \(\Delta t^*=100\) (2×10⁴ steps), followed by benchmarking 10⁴ NVE steps 5 times in a row.