Re: [cf-pointobsconvention] Draft 2

To: John Caron <caron@xxxxxxxxxxxxxxxx>
Subject: Re: [cf-pointobsconvention] Draft 2
From: Ethan Davis <edavis@xxxxxxxxxxxxxxxx>
Date: Wed, 12 Sep 2007 16:11:35 -0600

A few comments:

1) The profile observation data CDL needs a "z" dimension:

z = 35 ;

2) For the "Creating Fast Access to Children" contiguous list indices,you should change it to "firstChild(i) and firstChild(i) +numChildren(i) -1" or specify that it is an inclusive/exclusive range.

3) In the list of data types, I don't find the definitions very clear. Ithink the definitions would benefit from starting with definitions ofthe base types and working up to the various types of collections forthose base types. I think the problem for me is that the connectednessdoesn't seem clear when we are talking about collections of the basetypes. Maybe:


Base types:

- Point data: a set of parameters measured at a time and location (fixedx, y, z, t)- Profile: an ordered set of data points along a vertical line at asingle instant in time (fixed x, y, t; connected in z).- Trajectory: an ordered set of data points along a curve in time andspace (connected in x, y, z, t; ordered in t).


Collections of base types:

- Collection of point data: a set of data points distributed in spaceand time (unconnected in x, y, z, t)

- Collection of profile data: What is the connectedness between profiles?
   - the profiles are not connected (point collection of profiles?)

- the profiles are all for the same location/station and ordered intime (a time series collection of profiles?)- the profiles are from many stations but all for the same time(station collection of profiles?)- Collection of trajectory data: Same here. What is the connectednessbetween trajectories?


Ethan

John Caron wrote:

Attached is a PDF of second draft of what my thoughts are on a pointobs convention. Its not really a proposal (yet), but a reasonableplace to start.
---

Point Observation Data
Draft 2
09/11/07
This is a convention for writing collections of point observations toa netCDF file. This builds on section 5 of the CF-1.0 document,replacing section 5.4 and 5.5 with a more general convention.
A point observation is a data measurement at a specific time andlocation. Each kind of measured data is placed in a data variable. Thetime and location values are placed into coordinate variables andauxiliary coordinate variables.The starting idea, as described in section 5, is to use thecoordinates attribute to associate auxiliary coordinate variables withthe data variables. For example, consider an unconnected collection ofpoints where ozone has been sampled:
dimensions:
 sample = 1000 ;

variables:
 float O3(sample) ;
   O3:long_name = "ozone concentration";
   O3:units = "1e-9" ;
   O3:coordinates = "lon lat z time" ;

 double time(sample) ;
   time:long_name = "time" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(sample) ;
   lon:long_name = "longitude" ;
   lon:units = "degrees_east" ;

 float lat(sample) ;
   lat:long_name = "latitude" ;
   lat:units = "degrees_north" ;

 float z(sample) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;
In this example, there are 1000 points in the collection, and we havechosen to name the dimension sample to clarify the distinction betweencollection dimensions and coordinates. The coordinates of the ithsample are time(i), lon(i), lat(i) and z(i).When the data is time ordered, its natural to use time as the sampledimension:
dimensions:
 time = 1000 ;

variables:
 float O3(time) ;
   O3:long_name = "ozone concentration";
   O3:units = "1e-9" ;
   O3:coordinates = "lon lat z time" ;

 double time(time) ;
   time:long_name = "time" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(time) ;
   lon:long_name = "longitude" ;
   lon:units = "degrees_east" ;

 float lat(time) ;
   lat:long_name = "latitude" ;
   lat:units = "degrees_north" ;

 float z(time) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;
Because time is now a coordinate variable, its values should bestrictly monotonic (i.e. the data is sorted by time). Formally, you nolonger need to include time in the coordinates attribute, since it isknown to be a coordinate. However, a suggested idiom is to list allcoordinates in the coordinates attribute, for clarity.
Data variables may have other dimensions. The following has a 3D windvector and a character array:
dimensions:
 sample = 1000;
 wind_vector = 3;
 inst_name_strlen = 23;

variables:

 float wind(sample, wind_vector);
   wind:long_name = "3D wind";
   wind:units = "m/s";
   wind:coordinates = "lon lat z time";

 char inst_name(sample, inst_name_strlen);
   inst_name:long_name = "instrument name";
   inst_name:coordinates = "lon lat z time" ;
We define profile observation data as point data that has a verticaldimension in the data, with a constant lat/lon (or x/y) location, forexample:
dimensions:
 sample = 1000 ;

variables:
 float O3(sample, z) ;
   O3:long_name = "ozone concentration";
   O3:units = "1e-9" ;
   O3:coordinates = "lon lat z time" ;

 double time(sample) ;
   time:long_name = "time" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(sample) ;
   lon:long_name = "longitude" ;
   lon:units = "degrees_east" ;

 float lat(sample) ;
   lat:long_name = "latitude" ;
   lat:units = "degrees_north" ;

 float z(sample, z) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;
In the above example each sample has the same number of z coordinates,but (possibly) different z coordinate values, creating the 2D zcoordinate. For the case where all samples have exactly the same zcoordinate values, it is more efficient, and better to use:
 float z(z) ;
   z:long_name = "height above mean sea level" ;
   z:units = "km" ;
   z:positive = "up" ;
There is an important restriction on how an auxiliary coordinateconnects to the data variable: the dimensions of the auxiliarycoordinate must be a subset of the dimensions of any data variablethat uses it. So z(sample, z) and z(z) are ok as an auxiliarycoordinate for O3(sample, z), but neither could be an auxiliarycoordinate for, say, O3( time).
Time series of station data
Suppose that point data is taken at a set of named locations calledstations. The set of observations at a particular station, if orderedby time, becomes a time series, and the file is a collection of timeseries of station data. In this case one could use:
dimensions:
 station = 10 ;  // measurement locations
 pressure = 11 ; // pressure levels
 time = UNLIMITED ;

variables:
 float humidity(time, pressure, station) ;
   humidity:long_name = "specific humidity" ;
   humidity:units = "" ;
   humidity:coordinates = "lat lon pressure time" ;

 double time(time) ;
   time:long_name = "time of measurement" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(station) ;
   lon:long_name = "station longitude";
   lon:units = "degrees_east";

 float lat(station) ;
   lat:long_name = "station latitude" ;
   lat:units = "degrees_north" ;

 float pressure(pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;
There are two problems with this scheme. The first is that eachstation has the same number of samples (times) allocated to it. Thisis called a rectangular array. When stations have different numbers ofsamples, one is forced to allocate the maximum sample size, and usemissing data values. In this example, the amount of wasted data isexacerbated by having a vertical (pressure) dimension in the data.Further, if the pressure coordinate variable can vary, one must use:
 float pressure(time, pressure, station) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;
The second problem in this example is that the coordinate values fortime are required to be the same for each set of measurements at eachstation. This can be fixed, however, by using
 double time(station, time) ;
   time:long_name = "time of measurement" ;
   time:units = "days since 1970-01-01 00:00:00" ;
As we try to represent more complicated arrangements of pointobservations, this issue of rectangular arrays often appears.
A different way to handle variable number of samples at each stationis to remove the station dimension from the data variables, and keeptrack of the station index for each observation in a separate variable:
dimensions:
 station = 10 ;  // measurement locations
 pressure = 11 ; // pressure levels
 profile = UNLIMITED ;

variables:
 float humidity(profile, pressure) ;
   humidity:long_name = "specific humidity" ;
   humidity:coordinates = "lat lon pressure time" ;

 int station_index(profile) ;
   station_index:long_name = "index into station dimension";

 double time(profile) ;
   time:long_name = "time of measurement" ;
   time:units = "days since 1970-01-01 00:00:00" ;

 float lon(station) ;
   lon:long_name = "station longitude";
   lon:units = "degrees_east";

 float lat(station) ;
   lat:long_name = "station latitude" ;
   lat:units = "degrees_north" ;

If the pressure coordinate is constant, then

 float pressure(pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

If the pressure coordinate can vary for each profile:

 float pressure(profile, pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;

If its fixed for each station, you’d like to use:

 float pressure(station, pressure) ;
   pressure:long_name = "pressure" ;
   pressure:units = "hPa" ;
The station_index variable associates the ith profile with the stationat index station_index(i). But lat and lon can no longer be consideredauxiliary coordinate variables, since they use a dimension that is notpresent in the data variable. Instead, there is an extra level ofindirection represented by the station_index variable. So we arereally generalizing past previous notions of coordinate variables andauxiliary coordinate variables.
Instead of making coordinate variables more complicated, we are goingto generalize the underlying data model, using concepts fromrelational databases. In addition to the fundamental data type ofmultidimensional array, we add the data type table, where a table is acollection of variables with the same outer dimension. We then definean index join as connecting two tables using a variable in one tablethat holds dimension indices into the second table. Dimension indicesare zero based.
Returning to our time series of station data example, we can create anew notation using tables. All variables with the same outerdimension, such as:
 float humidity(profile, pressure) ;
   humidity:long_name = "specific humidity" ;
 float temperature(profile, pressure) ;
   temperature:long_name = "air temperature" ;
 float pressure(profile, pressure) ;
   pressure:long_name = "pressure" ;

 int station_index(profile) ;
 double time(profile) ;

are rewritten as:

 table {
   float humidity(pressure) ;
     humidity:long_name = "specific humidity" ;
   float temperature(pressure) ;
     temperature:long_name = "air temperature" ;
   float pressure(pressure);
     pressure:long_name = "pressure" ;

   int station_index;
   double time;

 } profile (profile);
So a “table variable” is created that uses the profile (outer)dimension. All the variables that have that outer dimension becomepart of the table. Similarly for the station table (for clarity, westop showing the attributes):
 table {
   float humidity(pressure) ;
   float temperature(pressure) ;
   float pressure(pressure);
   int station_index;
   double time;
 } profile(profile);

 table {
   float lon;
   float lat;
  } station (station);

To specify the index join, if we wanted to write pseudo-SQL, we could say
 JOIN profile TO station WITH profile.station_index
where profile and station specify tables with the correspondingdimension, and station_index is a variable in the profile table whosevalues are indices in the station table. In other words:
 JOIN <child dimension> TO <parent dimension> WITH <child.variable>
Of course, none of this is in the netCDF file, it’s just a short handnotation.
Another compact and useful notation is to consider that the tables arenested, and to ignore the mechanism by which the nesting occurs:
 table {
   float lon;
   float lat;

   table {
     double time;

     float humidity(pressure) ;
     float temperature(pressure) ;
     float pressure(pressure);
   } profile (*);

  } station (station);
Here the (*) denotes a variable length dimension. All of the profilesinside of a station table are for that station. Note that because weare using a fixed pressure dimension, all profiles have a fixed numberof pressure levels. The values of those pressure levels can vary fromprofile to profile. If the pressure levels were fixed at each station,you would have:
 table {
   float lon;
   float lat;
   float pressure(pressure);

   table {
     double time;

     float humidity(pressure) ;
     float temperature(pressure) ;
   } profile (*);

  } station (station);

If the pressure levels were fixed for all profiles:

 float pressure(pressure);
 table {
   float lon;
   float lat;

   table {
     double time;

     float humidity(pressure) ;
     float temperature(pressure) ;
   } profile (*);

  } station (station);
If the number of pressure levels could vary from profile to profile,we are back in the situation of having to set a maximum, then usingmissing values. Applying the same principles as before we can createanother table, for example (using nested table notation):
 table {
   float lon;
   float lat;

   table {
     double time;

     table {
       float humidity;
       float temperature;
       float pressure
     obs(*);

   } profile (*);

  } station (station);

OR using table notation:

 table {
   float humidity ;
   float temperature;
   float pressure;
   int profile_index;
 } obs (obs);

 table {
   int station_index;
   double time;
 } profile (profile);

 table {
   float lon;
   float lat;
 } station (station);

OR using CDL:

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
 int profile_index(obs);

 double time(profile);
 double station_index(profile);

 double lat(station);
 double lon(station);
As you can see, there’s a mechanical conversion between these 3notations (CDL, tables, nested tables).
Using the Unlimited Dimension
The use of the unlimited dimension in the netcdf-3 file formatwarrants attention because it can have a strong effect on performance.Consider the following example:
dimensions:
 station = 4021 ;  // measurement locations
 pressure = 30 ; // pressure levels
 time = UNLIMITED ; // currently 117987

variables:

 float humidity(time, pressure) ;
 float temperature(time, pressure) ;
 float pressure(time, pressure) ;
 int time(time) ;
 int station_index(time) ;

 char name(station, name_strlen);
 char desc(station, desc_strlen);
 double lat(station);
 double lon(station);
 double alt(station);
All of the variables using the time dimension are called recordvariables because they use the unlimited (record) dimension.The layout of the netCDF-3 file format is simple: first the header iswritten, then the non-record variables are each written, then therecord variables are written. Non-record variables are written in theorder they are defined. The entire space must be allocated for them atdefine time, which is why their dimension sizes cant change. Recordvariables are written one record at a time, where record 0 has all therecord variable values for index=0, then record 1 with all the recordvariable values for index=1, etc. The unlimited dimension can thusgrow by appending to the file.Since the file layout is quite different depending whether theunlimited dimension is used, the performance of reading the data canbe quite different. In a worse case scenario, for large files, youmight see a factor of 100 performance difference, depending on yourread access pattern (the actual times are highly dependent on thecaching strategy of the underlying file system). So it is sometimesnecessary to understand what the common read pattern is and tooptimize the file layout for it.Using the record dimension is often very useful when writing data thatarrives sequentially, since the new data can simply be appended to thefile, and you don’t need to know ahead of time how many records therewill be.
The decision to use the record dimension or not must not effect thedata type or the semantics of the data – only access efficiency.
Creating Fast Access to Children
Given a row in a child table, one finds the parent using the parentindex variable. However, one must read the entire parent indexvariable to find all of the child rows for a given parent row. Forefficiency, one can optionally add a way to quickly find all of thechild rows for a given parent row, using a linked list or a contiguouslist.
A contiguous list places all children in contiguous rows, and thenadds firstChild and numChildren variables in the parent table whichhold dimension indices into the child table. For the ith parent row,all its children are found at the indices between firstChild(i) andfirstChild(i) + numChildren(i). This method is recommended as the mostefficient way to read all the child rows for a parent, since they arestored contiguously.
A forward linked list adds a firstChild in the parent table andnextChild variable in the child table, which hold dimension indicesinto the child table. One reads the firstChild row and follows thelinks in nextChild until the dimension index is less than 0,indicating the end of the linked list. This method is recommended whenwriting data for multiple parents at once, when the total number ofchildren is unknown, so a contiguous list is not possible.
A backwards linked list adds a lastChild in the parent table andprevChild variable in the child table, again which hold dimensionindices into the child table. One reads the lastChild row and followsthe links in prevChild until the dimension index is less than 0,indicating the end of the linked list. This method is recommended forreal-time data arriving serially and unpredictably, since one only hastrack the last child for each parent in memory and append the newrecord, then update the lastChild array when the data has all beenreceived. With a forward linked list, one must also rewrite theprevious record.
Remember that dimension indices are 0 based.
Specifying the type of data
The table data type and technique of connecting tables throughdimension index variables is quite general and should be useful formany kinds of data in any domain of science.
Experience has shown that it’s important for visualization andanalysis tools and for human understanding to classify data into broadcategories based on the topology of the collection. We call these datatypes. We haven’t found a systematic or rigorous classificationscheme; rather these reflect our experience with observationaldatasets in the earth sciences, strongly influenced by the type ofmeasuring instruments used.
While one could imagine everything as merely a collection of points,it is usually necessary to take advantage of whatever structure isfound in the data. The structure of the data and coordinate systemsideally reflects the connectedness (a.k.a. topology ) of themeasurements. This connectedness is not always able to be ascertainedby inspecting the structure of the coordinate systems. For example,trajectories and point data have the same structure.
The set of data types we propose to standardize in the convention are:
• Collection of point data (unconnected x,y,z,t) Examples:earthquake data.• Collection of trajectories (connected x,y,z,t, ordered t)Examples: aircraft data, drifting buoy.• Collection of profiler data (unconnected x,y,t, connected z)Examples: satellite profiles.• Station collection of point (unconnected x,y,z, connected t)Examples: metars.• Station collection of profilers (unconnected x,y; connected z,connected t) Examples: profilers.
These mostly fit the form (Collection | Station Collection) of (Point| Profile | Trajectory). Others that might be needed:
• Trajectories of sounding (connected x,y,z,t, ordered z, orderedt) Examples: ship soundings.
CDL Examples

Collection of point data

variables;
 float lon(obs);
 float lat(obs);
 float z(obs);
 double time(obs);

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
   pressure:coordinates = “lon lat z time”;

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of point data”;
 :CF_datatype = “point”;
 :CF_table = “obs”;

Collection of profiler data (rectangular)

variables;
 float lon(obs);
 float lat(obs);
 float z(obs, z); // or z(z)
 double time(obs);

 float humidity(obs, z);
 float temperature(obs, z);
 float pressure(obs, z);
   pressure:coordinates = “lon lat z time”;

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of profiler data”;
 :CF_datatype = “profiler”;
 :CF_table = “obs”;

Collection of Trajectories

variables;
 float lon(obs);
 float lat(obs);
 float z(obs);
 double time(obs);

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
   pressure:coordinates = “lon lat z time”;

 int trajectory_id(obs); // unneeded if only one trajectory LOOK

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of trajectory data”;
 :CF_datatype = “trajectory”;
 :CF_table = “obs”;


Collection of Trajectories of Sounding (rectangular)

variables;
 float lon(sounding);
 float lat(sounding);
 double time(sounding);
 float z(sounding, z); // or z(z)

 float humidity(sounding, z);
 float temperature(sounding, z);
 float pressure(sounding, z);
   pressure:coordinates = “lon lat z time”;

 int trajectory_index(sounding); // unneeded if only one trajectory

 char ship_name( trajectory, ship_name_strlen) ;
 char instrument( trajectory, instrument_strlen) ;


attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of trajectory of sounding data”;
 :CF_datatype = “trajectory of sounding”;
 :CF_table = “JOIN sounding TO trajectory WITH trajectory_index”;


Collection of Trajectories of Soundings (variable z)

Variables:
 float salinity(obs) ;
 float temperature(obs) ;
 float pressure(obs) ;
 double time(obs) ;
 int sounding_index(obs) ;

 float lat(sounding) ;
 float lon(sounding) ;
 int trajectory_index(sounding) ;

 char ship_name( trajectory, ship_name_strlen) ;
 char instrument( trajectory, instrument_strlen) ;

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Collection of trajectory of sounding data”;
 :CF_datatype = “trajectory of sounding”;
:CF_table = “JOIN sounding TO trajectory WITH trajectory_index ANDJOIN obs TO sounding WITH sounding_index”;
Station Collection of Point

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);

 double time(obs);
 double station_index(obs);

 double lat(station);
 double lon(station);

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Station Collection of point”;
 :CF_datatype = “Station”;
 :CF_table = “JOIN obs TO station WITH station_index”;

Station Collection of Profilers (fixed length)

 float humidity(profile, z);
 float temperature(profile, z);
 float pressure(profile, z);

 double time(profile);
 double station_index(profile);

 double lat(station);
 double lon(station);

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Station Profilers”;
 :CF_datatype = “Station Collection of Profiler”;
 :CF_table = “JOIN profile TO station WITH station_index”;

Station Collection of Profilers (variable length)

 float humidity(obs);
 float temperature(obs);
 float pressure(obs);
 int profile_index(obs);

 double time(profile);
 double station_index(profile);

 double lat(station);
 double lon(station);

attributes:
 :Conventions = “CF-1.1”;
 :CF_datatype = “Station Profilers”;
 :CF_datatype = “Station Collection of Profiler”;
:CF_table = “JOIN profile TO station WITH station_index AND JOIN obsTO profile WITH profile_index”;
Still To Do:
• Decide on the mechanism by which the join is specified. Do wereally want “pseudo-SQL” ?
•    Specify the datatypes globally or ??
• What do you put the :coordinate attribute on? All data variableswould follow existing CF. Then you have a redundant system somewhat.• Sorting: when can you count on it being sorted? Eg time series instation data. Required or optional?
------------------------------------------------------------------------

_______________________________________________
cf-pointobsconvention mailing list
cf-pointobsconvention@xxxxxxxxxxxxxxxx
For list information or to unsubscribe, visit: http://www.unidata.ucar.edu/mailing_lists/


--
Ethan R. Davis                                Telephone: (303) 497-8155
Software Engineer                             Fax:       (303) 497-8690
UCAR Unidata Program Center                   E-mail:    edavis@xxxxxxxx
P.O. Box 3000
Boulder, CO  80307-3000                       http://www.unidata.ucar.edu/
---------------------------------------------------------------------------

Follow-Ups:
- Re: [cf-pointobsconvention] Draft 2
  - From: John Caron

References:
- [cf-pointobsconvention] Draft 2
  - From: John Caron