Re: [cf-pointobsconvention] Draft 2 comments

To: John Caron <caron@xxxxxxxxxxxxxxxx>
Subject: Re: [cf-pointobsconvention] Draft 2 comments
From: Joe Sirott <Joe.Sirott@xxxxxxxx>
Date: Sat, 22 Sep 2007 17:00:46 -0700

Hi John,

NetCDF works quite well as a scientific data exchange format. It doesn'twork as well as a container format.The data model for in-situobservations is much simpler if you use netCDF to represent oneprofile/trajectory/sounding and bundle a collection of theseobservations using some kind of archive format (could be zip or jar ortar or gzip).

There are libraries for reading and writing directly to and from ziparchives (I've seen Java, Ruby, and Python versions), so you don't haveto zip/unzip each time you modify the archive. You also don't run intothe inode exhaustion problem that occurs (as you mentioned) when storinga large number of small files if you use the archive directly.

I already use zip archives for storing netCDF files for our Dapperserver. I store several million netCDF files containing profiles fromthe World Ocean Database, for instance, in zip files that were directlygenerated (no intermediate files) from a Python script. Some of the zipfiles contain up to 250000 profiles. If a profile turns out to be bad,it's very easy to remove it from the archive. Haven't had any problemswith this scheme.


Cheers, Joe

John Caron wrote:

Hi Joe, comments in line:


Joe Sirott wrote:
Hi John,

Thanks for taking the time to come up with this specification. It
looks like a good start. I do have some concerns about the complexity
of the spec, though, and would like to suggest a few changes that
might make it easier to use.

I believe that this spec is too complicated for most potential
users. For instance, it appears that any software that is able to read
these collections will have to have parse a SQL-like expression in
order to interpret a collection.
Well its a simple syntax: "XXXX <dim_name> XXXX <dim_name> XXXX<variable_name>"
But i only threw it in to have something concrete. One could use 3seperate attributes.
Another source of complexity is the
varying dimensionality of the dimensions and observations (either 1D
or 2D depending on the type of data).
Yes, actually i think you could probably have any number of dimensions.
Still another example is the use
of character variables for storing attribute data for collections
(should software assume that any character variable is an attribute)?
I dont understand this, do you have an example?
It's also difficult to edit data with this convention. How would Iremove an
individual profile from a collection? Or, worse, what if points needed
to be added or removed from an individual profile? I'd have to
regenerate the entire netCDF file in the latter case. That makes this
convention only practical as an archive format.
Some variants are optimal for archival, others for dynamicmodification. The backwards linked list is optimal for addingarbitrary amounts of data efficiently, but its pretty bad when youread it. My intention is to give standard options that the user canchoose depending on need.If you want to throw me a use case, Ill try to give you a concretesolution.
An alternative would be to store each individual
profile/trajectory/time series in a separate netCDF file. Collections
would consist of a set of netCDF files stored in a zip or jar
file. The zip file could also contain some sort of (XML?) manifest
file that could contain metadata about the collection as a whole. Any
metadata associated with an individual profile would be stored as a
global attribute in the appropriate netCDF file. Editing a profile
would be as simple as extracting the netCDF file from the archive,
rewriting it, and then storing it back in the jar file.
This is a good solution sometimes, but not generally. Many small filesare not optimal for large archives. We are having trouble onmotherlode right now with excessive inode consumption. Unzipping istoo costly if the data is accessed often.
To make it even easier for consumers of this data, I would also
restrict the data type of all variables to double. Also, all fourx,y,z,t coordinates
would be required.
I also lean to requiring x,y,z,t coordinates, but others arent sosure. Note this is not the same as having x,y,z,t dimensions. In factthis is a very important part of the proposal that deserves to behighlighted.
Im claiming that the general way to do coordinate systems for thiskind of data looks something like
variables;
float lon(obs);
float lat(obs);
float z(obs);
double time(obs);

float dataVar(obs);
 dataVar:coordinates = “lon lat z time”;
rather than follow gridded data conventions like COARDS and usevariations of:
float dataVar(t,z,y,x);

I think this is what you are saying below.
Some examples (from your CDL examples):

Collection of point data
------------------------
Unchanged (just one file in archive)

Collection of profile data
--------------------------
For each netCDF file:

variables;
 double lon(1);
 double lat(1);
 double z(obs);
 double time(1);

 double humidity(obs);
 double temperature(obs);
 double pressure(obs);

Collection of trajectories
--------------------------
For each netCDF file:

variables;
 double lon(obs);
 double lat(obs);
 double z(obs);
 double time(obs);

 double humidity(obs);
 double temperature(obs);
 double pressure(obs);

Station time series
-------------------
variables;
 double lon(1);
 double lat(1);
 double z(1);
 double time(obs);

 double temperature(obs);
I think this looks fine, exccept I want to also cover the case wheresomeone needs to put more than one thing in a file.
Thanks for your input.

John

Follow-Ups:
- Re: [cf-pointobsconvention] Draft 2 comments
  - From: Bryan Lawrence

References:
- [cf-pointobsconvention] Draft 2 comments
  - From: Joe Sirott
- Re: [cf-pointobsconvention] Draft 2 comments
  - From: John Caron