Data files

Why HDF5?

While ASCII is a nice way to effortlessly dump arbitrary data in human readable form without worrying much about portability, it has some serious drawbacks:

  • ASCII representation requires about twice as much storage to represent floating-point number data than the binary equivalent. Large datasets such as particle trajectories require compression for reasonable file sizes.
  • The conversion between binary and ASCII is lossy: A binary floating-point number may not have an exact decimal representation. From the physics point of view, loss of accuracy in the least significant digit is bearable—for exact reproducibility of a simulation, for example to inspect software bugs, it is not.
  • Most importantly, ASCII does not enforce any data structure.

The HDF5 file format presents a solution to these problems:

  • Data is stored in a binary and portable format. HDF5 transparently handles any C data type: 8/16/32/64-bit integers and single- or double precision floating point numbers, whether in big- or little-Endian representation.
  • A file is structured hierarchically: Every file contains a root group, which may itself contain named datasets or further named groups. Both datasets and groups may described with attributes of arbitrary type.

Furthermore, HDF5 is a well established scientific file format, with C, C++, Fortran and python bindings available.

Working with HDF5 files

Using the h5dump tool

For a quick analysis of HDF5 data files, use the h5dump tool:

h5dump file | less -S

Individual groups or datasets may be displayed with:

h5dump -g /path/to/group file
h5dump -d /path/to/dataset file

To inspect only the structure of a file, ommitting the data, use:

h5dump -A file

Using Python and PyTables

PyTables is a python module wrapping the HDF5 library. It is based on NumPy, which implements a MATLAB-like interface to multi-dimensional arrays. This is where the HDF5 format reveals its true strength, as NumPy allows arbitrary transformations of HDF5 datasets, all while using a real programming language.

As a simple example, we open a HDF5 file and print a dataset:

import tables
f = tables.openFile("file")
d = f.root.path.to.dataset[:]
print d
f.close()

Attributes may be read with the _v_attrs class member:

print f.root.param.mdsim._v_attrs.dimension
print f.root.param.program._v_attrs.version

For further information, refer to the Numpy and Scipy Documentation and the PyTables User’s Manual.

A last hint: Try ipython, an interactive python shell with auto-completion.

File formats

Simulation parameters

All data files contain an identical param group with all simulation parameters. This is the definitive place to gather parameter values in your evaluation scripts. Do not (ab)use the log file for evaluation purposes.

param
 \-- correlation
 |    \-- block_count         number of blocks
 |    \-- block_shift         block shift of intermediate block levels
 |    \-- block_size          number of samples per block
 |    \-- max_samples         maximum number of acquired samples for lowest blocks
 |    \-- min_samples         minimum number of acquired samples per block
 |    \-- q_error             relative deviation of averaging wave vectors
 |    \-- q_values            wave vector value(s) for correlation functions
 |    \-- sample_rate         sample frequency at lowest block level
 |    \-- steps               scheduled number of steps
 |    \-- time                scheduled simulation time
 |
 \-- mdsim
 |    \-- backend             simulation backend name
 |    \-- blocks              number of CUDA blocks in grid
 |    \-- box_length          simulation box edge length
 |    \-- cell_length         cell edge length
 |    \-- cell_occupancy      average ratio of particles to cell placeholders
 |    \-- cells               number of cells per dimension
 |    \-- cutoff_radius       potential cut-off radii AA,AB,BB in units of sigma
 |    \-- density             number density
 |    \-- dimension           positional coordinates dimension
 |    \-- effective_steps     simulated number of steps
 |    \-- neighbour_skin      neighbour list skin
 |    \-- neighbours          number of placeholders per neighbour list
 |    \-- pair_separation     hard-sphere pair separation
 |    \-- particles           number of particles per species
 |    \-- placeholders        total number of cell list placeholders
 |    \-- potential_epsilon   potential well depths AA,AB,BB
 |    \-- potential_sigma     collision diameters AA,AB,BB
 |    \-- potential_smoothing C²-potential smoothing factor
 |    \-- tcf_backend         correlation functions backend (host or gpu)
 |    \-- thermostat_nu       heat bath collision probability
 |    \-- thermostat_steps    heat bath coupling frequency
 |    \-- thermostat_temp     heat bath temperature
 |    \-- threads             number of CUDA threads per block
 |    \-- timestep            simulation time-step
 |
 \-- program
      \-- name                program name (HALMD)
      \-- variant             compile-time feature flags
      \-- version             git repository version

Trajectories (TRJ)

A particle trajectory file contains three datasets:

trajectory
 \-- r                        periodically extended particle coordinates
 \-- v                        particle velocities
 \-- t                        trajectory times
periodically extended particle coordinates

A three-dimensional double-precision floating-point dataset. The first dimension is the trajectory sample number. The second dimension is the particle number. The third dimension is the coordinates dimension.

For the host backend, the particle coordinates reflect the internal state of the simulation. For the GPU backend, the coordinates are calculated from the periodic box traversal vector (an integer multiple of the box size) and the periodically reduced single-precision coordinates, which introduces rounding errors.

particle velocities
A three-dimensional double- or single precision floating-point dataset. The first dimension is the trajectory sample number. The second dimension is the particle number. The third dimension is the coordinates dimension.
trajectory times
A one-dimensional double-precision floating-point dataset. The first dimension is the trajectory sample number.

Thermodynamic variables (TEP)

A thermodynamic variables file contains one dataset per measured variable:

\-- EKIN                      mean kinetic energy per particle
\-- EPOT                      mean potential energy per particle
\-- ETOT                      mean total energy per particle
\-- PRESS                     virial pressure
\-- TEMP                      temperature
\-- VCM                       velocity center of mass

All datasets are two-dimensional. The first dimension describes the sample number. The second dimension contains the sample time and the variable value. All values are double-precision floating point numbers, but may be measured in single-precision internally depending on the backend.

mean kinetic energy per particle

\langle E^*_{kin}\rangle =
\frac{1}{N} \sum_{i=1}^N \frac{(\vec{v}^*_i)^2}{2}

mean potential energy per particle

\langle U^*\rangle = \frac{1}{N}
\sum_{i=1}^N \bigl(\sum_{j>i}^N U(\vec{r}^*_{ij})\bigr)

With the GPU backend, the inner sum is truncated to single-precision.

mean total energy per particle

\langle E^*\rangle = \langle U^*\rangle + \langle E^*_{kin}\rangle

virial pressure

P^* = \frac{N}{d V^*} \sum_{i=1}^N
\bigl( (\vec{v}^*_i)^2 +
\sum_{j>i}^N \vec{F}(\vec{r}^*_{ij})\cdot\vec{r}^*_{ij}\bigr)

With the GPU backend, the inner sum is truncated to single-precision.

temperature

T^* = \frac{1}{d N} \sum_{i=1}^N (\vec{v}^*_i)^2

velocity center of mass

\langle \vec{v}^*\rangle = \frac{1}{N} \sum_{i=1}^N \vec{v}^*

Time-correlation functions (TCF)

A time-correlation functions file contains one dataset per function:

\-- MSD                       mean squared displacement
\-- MQD                       mean quartic displacement
\-- VAC                       velocity auto-correlation function
\-- ISF                       coherent/intermediate scattering function
\-- SISF                      incoherent/self-intermediate scattering function
\-- SISF2                     squared self-intermediate scattering function
\-- STRESS                    virial stress

Datasets are either of three- or four-dimensional double-precision type.

For three-dimensional datasets, the first dimension is the block level. The second dimension is the block size. The third dimension contains the correlation time, the mean average, the standard error of mean, the variance and the count.

For four-dimensional datasets, the first dimension is the wave vector. The second dimension is the block level. The third dimension is the block size. The fourth dimension contains the wave number, the correlation time, the mean average, the standard error of mean, the variance and the count.

mean squared displacement

A three-dimensional dataset.

\delta r(t)^2 = \left\langle\lvert \vec{r_i}(t)-\vec{r_i}(0)\rvert^2\right\rangle

mean quartic displacement

A three-dimensional dataset.

\delta r(t)^4 = \left\langle\lvert \vec{r_i}(t)-\vec{r_i}(0)\rvert^4\right\rangle

velocity auto-correlation function

A three-dimensional dataset.

\mathcal{Z}(t) = \Big\langle \vec{v}(t)\cdot\vec{v}(0)\Big\rangle

coherent/intermediate scattering function

A four-dimensional dataset.

F(\vec{q},t) = \frac{1}{N}\Bigl\langle
\Bigl(\sum_{i=1}^N e^{-i\vec{q}\cdot\vec{r_i}(t)}\Bigr)
\Bigl(\sum_{i=1}^N e^{i\vec{q}\cdot\vec{r_i}(0)}\Bigr)\Bigr\rangle

incoherent/self-intermediate scattering function

A four-dimensional dataset.

F_s(\vec{q},t) = \frac{1}{N} \Bigl\langle\sum_{i=1}^N
e^{-i\vec{q}\cdot\bigl(\vec{r_i}(t)-\vec{r_i}(0)\bigr)}\Bigr\rangle

squared self-intermediate scattering function

A four-dimensional dataset.

F_s(\vec{q}, t)^2 = \frac{1}{N} \Bigl\langle\Bigl(\sum_{i=1}^N
e^{-i\vec{q}\cdot\bigl(\vec{r_i}(t)-\vec{r_i}(0)\bigr)}\Bigr)^2
\Bigr\rangle

virial stress

A three-dimensional dataset.

J(t) = \Bigl\langle \Bigl(\sum_{i=1}^N
\bigl(v^*_k(t) v^*_l(t) +
\sum_{j>i}^N F^*_{k,ij}(t) r^*_{l,ij}(t)\bigr)
\Bigr)\Bigl( \sum_{i=1}^N
\bigl(v^*_k(0) v^*_l(0) +
\sum_{j>i}^N F^*_{k,ij}(0) r^*_{l,ij}(0)\bigr)
\Bigr) \Bigr\rangle_{k\le l}

Profiling (PRF)

A profile file contains a dataset for each CPU or GPU performance counter.

times
 \-- boltzmann                Boltzmann distribution
 \-- event_queue              event queue processing
 \-- hilbert_sort             Hilbert curve sort
 \-- init_cells               cell lists initialisation
 \-- lattice                  lattice generation
 \-- maximum_displacement     maximum particle displacement reduction
 \-- maximum_velocity         maximum velocity reduction
 \-- mdstep                   MD simulation step
 \-- memcpy_cells             cell lists memcpy
 \-- permutation              phase space sample sort
 \-- potential_energy         potential energy sum reduction
 \-- random_config            random initial particle configuration
 \-- reduce_contacts          mean number of contacts reduction
 \-- reduce_squared_velocity  mean squared velocity reduction
 \-- reduce_velocity          velocity center of mass reduction
 \-- sample                   phase space sampling
 \-- sample_memcpy            sample memcpy
 \-- update_cells             cell lists update
 \-- update_forces            Lennard-Jones force update
 \-- update_neighbours        neighbour lists update
 \-- velocity_verlet          velocity-Verlet integration
 \-- virial_sum               virial equation sum reduction

Each dataset contains the average execution time of a GPU or CPU function in seconds, the standard deviation in seconds and the number of measurements.