NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
We are evaluating HDF5 and NetCDF4 as archive file formats for fusion research data. We would like to use the same format for experimental (shot-based) data and modelling code data, to get the benefits of standardisation (one API to learn, one interface module to write for visualization tool access, etc). A number of fusion modelling codes use NetCDF. NetCDF for experimental data will be new though, so far as I know. I've found some problems in shot data archiving tests which need to be resolved for it to be considered further. MAST (Mega-Amp Spherical Tokamak) shot data (from magnetic sensors etc) is mostly digitized in the range 1kHz to 2MHz. MAST shots are currently less than 1 second in duration, but 5 second shots are forseen (some other experiments have much longer shot times). We use up to 96-channel digitizers. Acquisition start time and sample period is common to a digitizer, but the number of samples per channel sometimes varies - that is, some channels may be sampled for a longer time than others. Channel naming is hierarchical. There are two NetCDF-related issues here. The first is how to store the channel data, the second how to store time, both efficiently of course. We want per-variable compression. We don't want uninitialised value padding in variable data, even if it would be efficiently compressed. In the normal case where acquisition start time and sample period is common to all channels in a dataset, we would prefer to define just one dimension, not many if channel data array sizes vary. NetCDF4 tests with a single fixed dimension, writing varying amounts of data to uncompressed channel variables, shows that the variables are written to the archive file with padding, even in no_fill mode. The file size is independent of the amount of data written. NetCDF4 tests with a single unlimited dimension work for very small dimension sizes, but take forever to write even a single 4 MSample channel variable (we are using HDF5 1.8.2 if that's relevant to this problem). That looks the right way to go if the processing time and memory overhead is small, but we can't test it. Coming to storage of the time coordinate variable. If we actually store the data, it will need to be in a double array to avoid loss of precision. Aleternatively we could define the variable as an integer with a double scale and offset. Both of these sound inefficient. Traditionally we store this type of data as a (sequence of) triple: start time, time increment, count. Clearly we can do that within a convention, expanding it in reader code. How should we handle this? Your comments would be appreciated. Regards John Storrs -- John Storrs, Experiments Dept e-mail: john.storrs@xxxxxxxxxxxx Building D3, UKAEA Fusion tel: 01235 466338 Culham Science Centre fax: 01235 466379 Abingdon, Oxfordshire OX14 3DB http://www.fusion.org.uk
netcdfgroup
archives: