Re: [netcdfgroup] NetCDF4 for Fusion Data

To: John Storrs <john.storrs@xxxxxxxxxxxx>
Subject: Re: [netcdfgroup] NetCDF4 for Fusion Data
From: Ed Hartnett <ed@xxxxxxxxxxxxxxxx>
Date: Fri, 16 Jan 2009 09:27:20 -0700
John Storrs <john.storrs@xxxxxxxxxxxx> writes:

> We are evaluating HDF5 and NetCDF4 as archive file formats for fusion
research
> data. We would like to use the same format for experimental (shot-based) data
> and modelling code data, to get the benefits of standardisation (one API to
> learn, one interface module to write for visualization tool access, etc). A
> number of fusion modelling codes use NetCDF. NetCDF for experimental data
> will be new though, so far as I know. I've found some problems in shot data
> archiving tests  which need to be resolved for it to be considered further.
>
> MAST (Mega-Amp Spherical Tokamak) shot data (from magnetic sensors etc) is
> mostly digitized in the range 1kHz to 2MHz. MAST shots are currently less
> than 1 second in duration, but 5 second shots are forseen (some other
> experiments have much longer shot times). We use up to 96-channel
digitizers.
> Acquisition start time and sample period is common to a digitizer, but the
> number of samples per channel sometimes varies - that is, some channels may
> be sampled for a longer time than others. Channel naming is
> hierarchical.

I wonder if you could send the CDL of the test files you've come up
with (i.e. run ncdump -h on the files).

This would make your proposed data structures more clear.

Also is this code in C, fortran, C++, Java? Or something else?

> There are two NetCDF-related issues here. The first is how to store the
> channel data, the second how to store time, both efficiently of course. We
> want per-variable compression. We don't want uninitialised value padding in
> variable data, even if it would be efficiently compressed. In the normal case
> where acquisition start time and sample period is common to all channels in a
> dataset, we would prefer to define just one dimension, not many if channel
> data array sizes vary.

Do you then store each channel in a different variable? Would it make
sense to use two dimensions, time and channel?

Can I ask why you don't want the uninitialized values stored in the
file, even if they are compressed away? (Which will not happen unless
you set fill mode, BTW, since unfilled values will contain random bits
that will not compress well).

An alternative would be to use the newly introduced variable length
arrays in netCDF-4. In this case you don't store any padding
values. But using VLENs means that existing netcdf code will not work
on the resulting data files, as VLEN was just introduced, and existing
code will not know how to deal with it.

For your code this might not matter much, because you are writing it
from scratch. But it also means that existing visualization packages
will not cope with the data.

>> NetCDF4 tests with a single fixed dimension, writing varying amounts of data
> to uncompressed channel variables, shows that the variables are written to
> the archive file with padding, even in no_fill mode. The file size is
> independent of the amount of data written.

No_fill mode doesn't mean that the file size will change, just that
the program will not take the time to initialize all the data to the
fill value.

Are you saying that it *is* initializing the data to a fill value?
(That would be a bug.) Or just that the file size indicates that the
data values are there (but filled with junk)?

For example, a 10x10 array will have 100 values whether fill mode is
off or on. If it is on, then they will all be initialized to the fill
value. If it is off they will just contain garbage.

> NetCDF4 tests with a single unlimited dimension work for very small dimension
> sizes, but take forever to write even a single 4 MSample channel variable (we
> are using HDF5 1.8.2 if that's relevant to this problem). That looks the
> right way to go if the processing time and memory overhead is small, but we
> can't test it.

Probably this is a chunksize problem. NetCDF-4 selects a default set
of chunk sizes when you create a variable. Change them with the
nc_def_var_chunking function (after defining the variable, but before
the next enddef).

For very large variables, the default chunksizes don't work well in
the 4.0 release. (This is fixed for the upcoming 4.0.1 release, so you
can get the daily snapshot and see if this problem is better. Get the
daily snapshot here:
ftp://ftp.unidata.ucar.edu/pub/netcdf/snapshot/netcdf-4-daily.tar.gz)

Try setting the chunksizes to something reasonable by inserting the
nc_def_var_chunking function call. For example, try setting it to the
size of the array of data that you are writing in one call to
nc_put_vara_*. (Or some integer multiple of that).

Chunking is complex, but only important if you are I/O bound. (As you
apparently will be, with your 4 MSample case, using the 4.0 default
chunksizes.)

> Coming to storage of the time coordinate variable. If we actually store the
> data, it will need to be in a double array to avoid loss of precision.
> Aleternatively we could define the variable as an integer with a double scale
> and offset. Both of these sound inefficient.  Traditionally we store this
> type of data as a (sequence of) triple: start time, time increment, count.
> Clearly we can do that within a convention, expanding it in reader code.
> How should we handle this?

Would the CF conventions time coordinate work for you? This is a start
time stored as an attribute, and then the time of each observation
stored in the coordinate variable. For example:

double time(time) ;
  time:long_name = "time" ;
  time:units = "days since 1990-1-1 0:0:0" ;

(Of course, you would want seconds, not days).

For more, see:
http://cf-pcmdi.llnl.gov/documents/cf-conventions/1.4/cf-conventions.html#time-coordinate

Thanks,

Ed

--
Ed Hartnett  -- ed@xxxxxxxxxxxxxxxx
References:
- [netcdfgroup] NetCDF4 for Fusion Data
  - From: John Storrs