NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

netCDF and "complex" data

There has been much discussion (mostly last month) about storing complex
structures in netCDF.  I have been meaning to respond, but I have been very
busy.  A lot of the e-mail traffic duplicates a few things that we talked
about at NSSDC a few years ago for CDF, but had not implemented at that time.
However, here at the Visualization Systems Group at IBM T. J. Watson Research
Center, the discussion also duplicates much of what was considered
and then implemented almost two years ago as part of the development of the
IBM Visualization Data Explorer software.  Therefore, I have attached a brief
document that describes some of what we have done in the context of importing
data stored in netCDF for your consideration.  I believe it addresses many of
the issues that have been discussed in this forum.  Keep in mind this is NOT a
proposal, but an outline of an actual implementation that is available in
a commercial software product today.  Although the ideas can be cast in a form
independently of Data Explorer, that software does fully support them.  If you
have any questions, and especially if you have any comments (positive or
negative), please let me know.

Lloyd Treinish

-------------------------------------------------------------------------------

Importing netCDF data into IBM Visualization Data Explorer

Lloyd A. Treinish
Visualization Systems Group
IBM T. J. Watson Research Center
P. O. Box 704
Yorktown Heights, NY 10598
lloydt@xxxxxxxxxxxxxx

The IBM Visualization Data Explorer (DX) is a general-purpose software package
for scientific data visualization.  It employs a data-flow-driven client-
server execution model and is currently available on five platforms: IBM POWER
Visualization Systems (a medium-grain, shared memory parallel supercomputer)
and workstations -- IBM RISC System/6000, Silicon Graphics Indigo and Crimson,
Hewlett-Packard 700 and Sun Sparcstation 2.  DX is built on a foundation of an
internal data model, which describes and provides uniform access services for
any data brought into, generated by, or exported from the software.  Hence, it
has a notion of supporting a number of different classes of interesting
scientific data, which can be described by its shape (size and number of
dimensions), rank (e.g., scalar, vector, tensor), type (float, integer, byte,
etc. or real, complex, quaternion), where the data are located in space
(positions), how the locations are related to each other (connections),
aggregates or groups (e.g., hierarchies, series, composites, etc.).  It also
supports those entities required for graphics and imaging operations within
the context of Data Explorer.  Generically, these are called "objects".  The
DX data model is supported with an applications programming interface (API)
for users or developers to create functions or operations (i.e., modules) for
DX. At the user-level (i.e., via a graphical user interface, visual
programming or scripting-language programming) the details of the data model
and this interface are hidden.  An important consequence of this approach is
that modules are polymorphic.  In addition, there is an external
representation, the native dx format.  It is a multiple sequential file
representation of DX objects.  The DX data model is quite rich.  Most of what
it can support is not directly expressable by netCDF.  Therefore, a
methodology to extend netCDF for use with Data Explorer was developed.  For
more information about the DX architecture and data model see, for example, R.
Haber et al, "A Data Model for Scientific Visualization with Provisions for
Regular and Irregular Grids", Proceedings IEEE Visualization '91 Conference,
pp. 298-305, October 1991, B. Lucas et al, "An Architecture for a Scientific
Visualization System".  Proceedings IEEE Visualization '92, pp. 107-113,
October 1992, and "IBM Visualization Data Explorer User's Guide, Second
Edition", IBM Document Number SC38-0496-1, August 1992.

NetCDF is a data abstraction for (self-describing) multi-dimensional blocks.
The descriptions are in terms of attributes, which may be assigned globally or
to one or more variables (i.e., a multi-dimensional block).  NetCDF in the DX
context provides a portable and commonly-used API (C and a veneer layer for
FORTRAN 77), and a fixed, portable physical file structure (a single XDR file)
in the public domain.  NetCDF only knows about arrays of scalars and is a
carrier for them and their descriptions.  There is NO knowledge or semantics
imbedded with regard to any other structure.  Since such arrays are inherently
flat and rectilinear there is insufficient information typically to define
suitable objects for import to DX, especially for irregular or hierarchical
data.  A netCDF user is free to define custom conventions for the array
storage and attribute nomenclature.  In this sense it is possible to create a
mechanism to support a limited set of other structures on top of the array
"protocol".  However, this also means that a generic netCDF reader would only
be able to report contents and be unable to operate on any underlying context.
Such a context for the creation of DX objects for their importation has been
defined.  Any system that attempts to support structures more complex than
what raw netCDF handles would have to deal with this situation.  The notion of
being able to import any random netCDF and create the correct DX object is NOT
possible given the limited netCDF vocabulary.  The exception would be for a
very limited class of regular/rectilinear arrays, in which any more complex
structure is ignored (e.g., a simple image).  The DX convention for simple
regular data is essentially based on that idea.  Hence, a visualization pack-
age that is only capable of dealing with "native" netCDF data would have to
have limited functionality.  DX is capable of dealing with a far greater vari-
ety of complex data, only a subset of which can be expressed effectively in
netCDF even when one does so via external constraints.

What are these aforementioned conventions?  The netCDF vocabulary is not
sufficiently rich nor at a high enough level to adequately describe the kinds
of objects that must be supported for general visualization and analysis.
This is a result of the heritage of the original CDF implementation at
NASA/GSFC in the mid-1980s.  Although the current CDF implementation at NASA
does address a few of these limitations, both netCDF and CDF still are focused
primarily on a relatively low-level abstraction -- multidimensional blocks.
DX objects can be decomposed to a lower level, that of multidimensional
arrays.  However, the DX array objects are more flexible than those of
CDF/netCDF model because they support rank and shape/dimensionality
independently.  Nevertheless, netCDF can be used as a carrier of self-de-
scribing multidimensional arrays, whose descriptions when following a certain
convention, can be used by DX to create proper objects.  Of course, this may
not always be practical since there are significant limitations on the kinds
of arrays that a single netCDF may contain based upon constraints such as
size, number of named dimensions, etc. due to what the netCDF software
supports and its physical file structure.  This is an additional justification
among other reasons for requiring a native structure.  The best way to
illustrate these ideas is with a few examples.

Scalar data that is on a regular grid can be imported into Data Explorer from
a "standard" netCDF file.  To import vector data, data on irregular grids, or
time series data, additional attributes must be added to the netCDF file.
These attributes allow you to specify the data, positions, and connections
components of your data set.

REGULAR GRIDS

To import scalar data on a regular grid, specify the netCDF file name as the
"name" parameter in the Import module.  By default, all netCDF variables will
be imported and collected into a group.  To import one or more particular
variables, specify their names as the "variable" parameter.  The "format"
parameter must be "netCDF."

Data Explorer automatically constructs positions and connections for each
variable, with an origin of 0.0 and spacings of 1.0 along each dimension.

For data that is logically a vector field, but whose values are stored in
three separate netCDF variables, each component of the vector can be imported
separately; the Compute module can then be used to create a single vector
field.  For data that is logically a vector field, but whose values are stored
as an n+1 dimensional regular grid, use the Slice and Compute modules to
separate the components of the vector, and then recombine them into a single
vector field.

Example of a Simple Regular Grid

The following netCDL describes a 3 x 3 x 3 regular grid at origin (0, 0, 0)
with deltas of 1.0 along each axis.

netcdf volume {

dimensions:
       nx = 3;
       ny = 3;
       nz = 3;

variables:
       float field_data(nx, ny, nz);

data:
       field_data
           0, 0, 0            0, 0, 0            0, 5, 0
           0, 0, 5            0, 0, 0            0, 0, 0
           5, 0, 0            0, 0, 0            0, 0, 0;
}

NetCDF on completely regular grids can be imported directly by Data Explorer
without modifying the netCDF file as indicated earlier.

COMPLEX FIELDS

For data with more complex structure, conventions have been established for
netCDF variable attributes, as described in the format below.  There are two
key variable attributes that you will need to define for each netCDF variable,
"field", which as far as you are concerned is used to specify the rank of the
parameter, and "positions", which is used to specify where the information
containing the locations of the data in space is stored.  The defaults for
connections (i.e., topological primitive) is quads, cubes, etc. depending on
the shape of the field.  If you do not specify positions, regularity is
assumed with origin at 0.0 and a spacing of 1.0.  Data Explorer does support
dimensional or array products.  This is a generalization of the notion of
product specification for rectilinear grids that is employed in CDF and
netCDF.  Hence, this idea is exploited in the netCDF conventions.

It should be noted that netCDF does not make a distinction about the
relationship between data dependency and mesh structure -- it is just arrays.
Such an distinction is at an applications level above netCDF.  Data Explorer
allows you to specify whether the values associated with a grid or mesh are to
be assigned at the node points of the mesh or the center of the grid cells.
For data in netCDF to be imported into DX, it is assumed that the data are
associated with node points (i.e., data are dependent on positions).  If this
is not appropriate for the data of interest, the Post module can be used to
convert to a cell-centered form (i.e., data are dependent on connections)
after importing.  Alternately, the additional field components described below
can be used.

IRREGULAR ARRAYS

Data

To indicate that a netCDF variable contains values corresponding to the data
component, it must have the following attribute:

     variable1:field = "fieldname";

Variable1 is the name of the netCDF variable containing data values to be
imported.  fieldname is the name of the Data Explorer field by which the user
refers to the data (for example, "temperature," "pressure," "wind").  If more
than one variable is tagged with the same field name, each variable is read
into a field, and the fields are collected into a group.

The data are read in as an array of values, one number per grid point.  If the
data are actually a vector or a matrix at each grid point, use one of the
following modifiers:

variable1:field = "fieldname, vector";

variable1:field = "fieldname, matrix";

The nonscalar data are stored in additional dimensions for the variable.  For
a static three-dimensional 3-vector, the three components are stored in a
fourth dimension of size 3.

If the data have both regular connections and regular positions, no other
attributes are required.  A regular grid is assumed, with the origin at 0.0,
and a spacing of 1.0 along each axis.  The number of axes will be determined
from the number of dimensions in the data array.

Positions

If the locations of the data values in variable1 do not form a regular lattice
(with origins at 0.0 and spacings of 1.0), the name of a netCDF variable that
contains the position information must be specified as an attribute for
variable1.

There are five different types of position specifications: none, completely
regular, completely irregular, and two types of partially regular.

Completely irregular is assumed if the following attribute is specified:

variable1:positions = "variable2";

where variable2 is an array of vectors, one for each grid point, defining its
location.  The dimensionality of the data space is determined by the number of
items in a vector.

Regular positions can be specified with just the origin and spacing between
grid points along each axis in compact form.  The following attribute is used:

variable1:positions = "variable2, compact";

where variable2 is the name of a n times 2 array containing origin, delta
pairs for the spacing and location of positions along each axis.  The number
of positions along each axis is determined from the shape of variable1.

Positions that can be specified as the product of arrays containing the
location of points along each axis can be input in product form.  Use the
following attribute:

variable1:positions = "variable2a, product;
                       variable2b, product;
                       .
                       .
                       .
                       variable2x, product";

where the variable2's are each the name of an array containing a list of
positions along that axis.  The number of items in each array must match the
length of the corresponding axis in the original variable1 data array.

If any of the axes in an partially regular product array are actually regular,
they can be specified in "compact" form:

variable1:positions = "variable2a, product, compact;
                       variable2b, product;
                       .
                       .
                       .
                       variable2x, product";

where variable2a is the name of an origin, delta array, and the rest are
position lists as before.

Connections

If the connections between positions is a regular lattice, no additional
attributes are necessary.  For 1D data, connections of "lines" is assumed.  2D
data implies "quads," 3D data implies "cubes" and for higher dimensions,
"hypercubes" is assumed.

If the connections are irregular, use one of the following attributes:

variable1:connections = "variable3, tetrahedra";

variable1:connections = "variable3, triangles";

variable1:connections = "variable3, cubes";

variable1:connections = "variable3, quads";

where variable3 is the name of an array containing a vector of point numbers,
defining each connection element item.  The length of this vector depends on
the choice of connections.  If the shape is not explicitly specified,
tetrahedra are assumed.

Additional Components

If additional component information is present in the file, the following
attributes are valid:

variable1:component = "variable4, componentname, scalar;
                       variable5, componentname, vector;
                       variable6, componentname, matrix";
and

variable4:attributes = "ref, componentname;
                        dep, componentname";

SERIES DATA

The DX data model does support aggregates of data, which can be treated as a
single entity.  Such aggregates may be hierarchical or a simple flat
collection of low-level objects like a (time) series.  There are three ways to
specify the import of datasets that should be treated as series: single
variable, separate variables or separate files.

Single Variable

When all data values are defined as a single netCDF variable, and the
unlimited dimension of the variable is to be interpreted as the series
dimension, then use one of the following forms of the "field" attribute:

variable1:field = "fieldname, scalar, series";

variable1:field = "fieldname, vector, series";

variable1:field = "fieldname, matrix, series";

All other specifications are the same as for simple fields.

The position and connection information is assumed to be constant for all
members of the series and hence, is not stored redundantly.  If the positions
or connections change for each step of the series, then the variables used for
those arrays must also have an unlimited dimension that corresponds one-for-
one with the data array.  An example using this method is shown below.

Separate Variables

When there are separate netCDF variables defined for each step in the series,
but all variables are in the same file, use the following global attribute
tags:

:seriesxxx = "fieldname;
              variable1a;
              variable1b;
              .
              .
              .
              variable1x";
or

:seriesxxx = "fieldname;
              variable1a, float_value;
              variable1b, float_value;
              .
              .
              .
              variable1x, float_value";

where the global tag must have the first 6 characters "series".  Global tags
must be unique, so additional characters can be added to distinguish them.
Each variable1x is the name array containing the data for that step.  In the
first format, the spacing of the steps is assumed to be 1.0.  In the second
format, the float_value is the value of each step.  All other specifications
are the same as for simple fields.  For example,

:series_temp = "temp; temp001; temp002; temp003; . . . ; temp999";

 or

:series_temp = "temp; temp001, 0.0; temp002, 0.3; temp003, 0.7";

Each name, tempnnn, is the name of a variable (array) containing the data for
each member of the series.

Separate Files

When there are netCDF variables in separate files which make up the steps of a
series, use the following global attribute tags:

:seriesxxx = "fieldname, files;
              filename1;
              filename2;
              .
              .
              .
              filenameN";

or

:seriesxxx = "fieldname, files;
              filename1, float_value;
              filename2, float_value;
              .
              .
              .
              filenameN, float_value";

where the global tag must have the first 6 characters "series".  Global tags
must be unique, so additional characters can be added to distinguish them.

Each filenameN is the name of the netCDF file which contains the data
variables for that step.  In the first format, the spacing of the steps is
1.0.  In the second format, the float_value is the value of each step.  All
other specifications are the same as for simple fields.

This format can be used to create short term series within a file, and then
have a series of these smaller series.  The syntax is an extension of what is
done for multiple steps being multiple variables within a file.  For example,

:series_temp = "temp, files; temp_file1; temp_file2; temp_file3; . . .
temp_fileN";

or

:series_temp = "temp, files; temp_file1, 1001.0; temp_file2, 1001.5
temp_file3, 1002.0; . . . temp_fileN, 1231.5";


Compact Specifications of Regular Dimensions

This example describes a single two-dimensional scalar field on a latitude-
longitude, regular, rectangular grid. The example data are temperature on a
one-degree grid with global coverage.  For regular dimensions, storing all the
grid locations is redundant and wasteful of storage, even if you use a product
notation that netCDF can handle.  Because Data Explorer array objects can be
specified compactly, you can use this method to specify a netCDF with regular
dimensions efficiently.  For each dimension, you need to specify its value at
the origin and its spacing along the dimension.

In this example, two variable attributes are defined for the netCDF variables.
"field" specifies the rank of the field parameter, and "positions" specifies
where the information containing the locations of the data is space is
located.

dimensions:
       lon = 360;
       lat = 180;
       naxes = 2;
       ndeltas = 2;

variables:
       float locations(naxes, ndeltas);
       float temperature(lat, lon);
       temperature:field = "temperature, scalar";
       temperature:positions = "locations, regular";

data:
       locations = 89.5, -1.,      // compact specification, origin and
                   -179, 1.;       // spacing for lat and lon

       temperature = ... ;         // Data for temperature

Partially Regular Grids and Time Series

This example describes an ocean circulation model, which consists of a time
series of four three-dimensional scalars (temp, sali, wata and conv) and one
three-dimensional 3-vector (vel). NetCDF would typically require that there
are seven variables (all scalars with the vector be stored as three scalars).
The coordinate system for the velocity vectors corresponds to that of the grid
(that is, +u implies north, +v implies east, and +w implies down).

These grids are partially regular in that the "time," "tlat," and "tlon"
portions (three out of the four dimensions) are all regularly spaced.  "time"
is to be mapped to members of a series group.  The fourth dimension, "tlvl,"
is irregularly spaced.  The compact notation can be used for the regular
notation, while all the values along the irregular dimension must be
specified; a product is formed from the dimensions.  The specification in
netCDL notation is:

dimensions:
       time = UNLIMITED;
       tlat = 30;
       tlon = 50;
       tlvl = 30;
       vsize = 3;       // At each grid cell for variable vel, there are

                        // three floats for the u, v, and w components of the
                        // vector field.
       naxes = 3;
       ndeltas = 2;

variables:
       float lat_axis(ndeltas, naxes);

       float lon_axis(ndeltas, naxes);

       float level_axis(tlvl, naxes);

       float temp(time, tlat, tlon, tlvl);
       temp:field = "temperature, scalar, series";
       temp:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";

       float sali(time, tlat, tlon, tlvl);
       sali:field = "salinity, scalar, series";
       sali:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";

       float wata(time, tlat, tlon, tlvl);
       wata:field = "water parage, scalar, series";
       wata:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";

       float conv(time, tlat, tlon, tlvl);
       conv:field = "covective index, scalar, series";
       conv:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";

       float vel(time, tlat, tlon, tlvl, vsize);
       vel:field = "velocity, vector, series";
       vel:positions = "lat_axis, product, compact; lon_axis, product,
compact; level_axis, product";

data:
       lat_axis = -14.667, 0., 0.,
                    0.333, 0., 0.;

       lon_axis = 0.0, -99.8, 0.0,
                  0.0, 0.5, 0.0;

       level_axis = 0.0, 0.0, 17.5,
                    0.0, 0.0, 53.425,
                    .
                    .
                    .
                    0.0, 0.0, 5374.98;

       temp = ... ;

       sali = ... ;

       wata = ... ;

       conv = ... ;

       vel = ... ;

Irregular Surface

This example is the netCDL description of a netCDF for an irregular surface,
that of the classic teapot.  It has precomputed normals, which are imported as
the "normals" component, in addition to positions and connections.

netcdf teapot {       // name of datafile is "teapot.ncdf"
                      // name of field is "surface"
dimensions:
           pointnums = 2268;
           trinums = 3584;
           axes = 3;
           sides = 3;

variables:
           float locations(pointnums, axes);
           float normalvect(pointnums, axes);
           long tris(trinums, sides);
           float surfacedata(pointnums);

// global attributes:
                   :source = "Classic Teapot, data from Turner Whitted";

// specific attributes:
                   surfacedata:field = "surface";
                   surfacedata:connections = "tris, triangles";
                   surfacedata:positions = "locations";
                   surfacedata:component = "normalvect, normals, vector";
                   normalvect:attributes = "dep, positions";

// This is the start of a large data section

data:
   .
   .
   .
}