While ASCII is a nice way to effortlessly dump arbitrary data in human readable form without worrying much about portability, it has some serious drawbacks:
The HDF5 file format presents a solution to these problems:
Furthermore, HDF5 is a well established scientific file format, with C, C++, Fortran and python bindings available.
For a quick analysis of HDF5 data files, use the h5dump tool:
h5dump file | less -S
Individual groups or datasets may be displayed with:
h5dump -g /path/to/group file
h5dump -d /path/to/dataset file
To inspect only the structure of a file, ommitting the data, use:
h5dump -A file
PyTables is a python module wrapping the HDF5 library. It is based on NumPy, which implements a MATLAB-like interface to multi-dimensional arrays. This is where the HDF5 format reveals its true strength, as NumPy allows arbitrary transformations of HDF5 datasets, all while using a real programming language.
As a simple example, we open a HDF5 file and print a dataset:
import tables
f = tables.openFile("file")
d = f.root.path.to.dataset[:]
print d
f.close()
Attributes may be read with the _v_attrs class member:
print f.root.param.mdsim._v_attrs.dimension
print f.root.param.program._v_attrs.version
For further information, refer to the Numpy and Scipy Documentation and the PyTables User’s Manual.
A last hint: Try ipython, an interactive python shell with auto-completion.
All data files contain an identical param group with all simulation parameters. This is the definitive place to gather parameter values in your evaluation scripts. Do not (ab)use the log file for evaluation purposes.
param
\-- correlation
| \-- block_count number of blocks
| \-- block_shift block shift of intermediate block levels
| \-- block_size number of samples per block
| \-- max_samples maximum number of acquired samples for lowest blocks
| \-- min_samples minimum number of acquired samples per block
| \-- q_error relative deviation of averaging wave vectors
| \-- q_values wave vector value(s) for correlation functions
| \-- sample_rate sample frequency at lowest block level
| \-- steps scheduled number of steps
| \-- time scheduled simulation time
|
\-- mdsim
| \-- backend simulation backend name
| \-- blocks number of CUDA blocks in grid
| \-- box_length simulation box edge length
| \-- cell_length cell edge length
| \-- cell_occupancy average ratio of particles to cell placeholders
| \-- cells number of cells per dimension
| \-- cutoff_radius potential cut-off radii AA,AB,BB in units of sigma
| \-- density number density
| \-- dimension positional coordinates dimension
| \-- effective_steps simulated number of steps
| \-- neighbour_skin neighbour list skin
| \-- neighbours number of placeholders per neighbour list
| \-- pair_separation hard-sphere pair separation
| \-- particles number of particles per species
| \-- placeholders total number of cell list placeholders
| \-- potential_epsilon potential well depths AA,AB,BB
| \-- potential_sigma collision diameters AA,AB,BB
| \-- potential_smoothing C²-potential smoothing factor
| \-- tcf_backend correlation functions backend (host or gpu)
| \-- thermostat_nu heat bath collision probability
| \-- thermostat_steps heat bath coupling frequency
| \-- thermostat_temp heat bath temperature
| \-- threads number of CUDA threads per block
| \-- timestep simulation time-step
|
\-- program
\-- name program name (HALMD)
\-- variant compile-time feature flags
\-- version git repository version
A particle trajectory file contains three datasets:
trajectory
\-- r periodically extended particle coordinates
\-- v particle velocities
\-- t trajectory times
A three-dimensional double-precision floating-point dataset. The first dimension is the trajectory sample number. The second dimension is the particle number. The third dimension is the coordinates dimension.
For the host backend, the particle coordinates reflect the internal state of the simulation. For the GPU backend, the coordinates are calculated from the periodic box traversal vector (an integer multiple of the box size) and the periodically reduced single-precision coordinates, which introduces rounding errors.
A thermodynamic variables file contains one dataset per measured variable:
\-- EKIN mean kinetic energy per particle
\-- EPOT mean potential energy per particle
\-- ETOT mean total energy per particle
\-- PRESS virial pressure
\-- TEMP temperature
\-- VCM velocity center of mass
All datasets are two-dimensional. The first dimension describes the sample number. The second dimension contains the sample time and the variable value. All values are double-precision floating point numbers, but may be measured in single-precision internally depending on the backend.
With the GPU backend, the inner sum is truncated to single-precision.
With the GPU backend, the inner sum is truncated to single-precision.
A time-correlation functions file contains one dataset per function:
\-- MSD mean squared displacement
\-- MQD mean quartic displacement
\-- VAC velocity auto-correlation function
\-- ISF coherent/intermediate scattering function
\-- SISF incoherent/self-intermediate scattering function
\-- SISF2 squared self-intermediate scattering function
\-- STRESS virial stress
Datasets are either of three- or four-dimensional double-precision type.
For three-dimensional datasets, the first dimension is the block level. The second dimension is the block size. The third dimension contains the correlation time, the mean average, the standard error of mean, the variance and the count.
For four-dimensional datasets, the first dimension is the wave vector. The second dimension is the block level. The third dimension is the block size. The fourth dimension contains the wave number, the correlation time, the mean average, the standard error of mean, the variance and the count.
A three-dimensional dataset.
A three-dimensional dataset.
A three-dimensional dataset.
A four-dimensional dataset.
A four-dimensional dataset.
A four-dimensional dataset.
A three-dimensional dataset.
A profile file contains a dataset for each CPU or GPU performance counter.
times
\-- boltzmann Boltzmann distribution
\-- event_queue event queue processing
\-- hilbert_sort Hilbert curve sort
\-- init_cells cell lists initialisation
\-- lattice lattice generation
\-- maximum_displacement maximum particle displacement reduction
\-- maximum_velocity maximum velocity reduction
\-- mdstep MD simulation step
\-- memcpy_cells cell lists memcpy
\-- permutation phase space sample sort
\-- potential_energy potential energy sum reduction
\-- random_config random initial particle configuration
\-- reduce_contacts mean number of contacts reduction
\-- reduce_squared_velocity mean squared velocity reduction
\-- reduce_velocity velocity center of mass reduction
\-- sample phase space sampling
\-- sample_memcpy sample memcpy
\-- update_cells cell lists update
\-- update_forces Lennard-Jones force update
\-- update_neighbours neighbour lists update
\-- velocity_verlet velocity-Verlet integration
\-- virial_sum virial equation sum reduction
Each dataset contains the average execution time of a GPU or CPU function in seconds, the standard deviation in seconds and the number of measurements.