Hi John,
NetCDF works quite well as a scientific data exchange format. It doesn't
work as well as a container format.The data model for in-situ
observations is much simpler if you use netCDF to represent one
profile/trajectory/sounding and bundle a collection of these
observations using some kind of archive format (could be zip or jar or
tar or gzip).
There are libraries for reading and writing directly to and from zip
archives (I've seen Java, Ruby, and Python versions), so you don't have
to zip/unzip each time you modify the archive. You also don't run into
the inode exhaustion problem that occurs (as you mentioned) when storing
a large number of small files if you use the archive directly.
I already use zip archives for storing netCDF files for our Dapper
server. I store several million netCDF files containing profiles from
the World Ocean Database, for instance, in zip files that were directly
generated (no intermediate files) from a Python script. Some of the zip
files contain up to 250000 profiles. If a profile turns out to be bad,
it's very easy to remove it from the archive. Haven't had any problems
with this scheme.
Cheers, Joe
John Caron wrote:
Hi Joe, comments in line:
Joe Sirott wrote:
Hi John,
Thanks for taking the time to come up with this specification. It
looks like a good start. I do have some concerns about the complexity
of the spec, though, and would like to suggest a few changes that
might make it easier to use.
I believe that this spec is too complicated for most potential
users. For instance, it appears that any software that is able to read
these collections will have to have parse a SQL-like expression in
order to interpret a collection.
Well its a simple syntax: "XXXX <dim_name> XXXX <dim_name> XXXX
<variable_name>"
But i only threw it in to have something concrete. One could use 3
seperate attributes.
Another source of complexity is the
varying dimensionality of the dimensions and observations (either 1D
or 2D depending on the type of data).
Yes, actually i think you could probably have any number of dimensions.
Still another example is the use
of character variables for storing attribute data for collections
(should software assume that any character variable is an attribute)?
I dont understand this, do you have an example?
It's also difficult to edit data with this convention. How would I
remove an
individual profile from a collection? Or, worse, what if points needed
to be added or removed from an individual profile? I'd have to
regenerate the entire netCDF file in the latter case. That makes this
convention only practical as an archive format.
Some variants are optimal for archival, others for dynamic
modification. The backwards linked list is optimal for adding
arbitrary amounts of data efficiently, but its pretty bad when you
read it. My intention is to give standard options that the user can
choose depending on need.
If you want to throw me a use case, Ill try to give you a concrete
solution.
An alternative would be to store each individual
profile/trajectory/time series in a separate netCDF file. Collections
would consist of a set of netCDF files stored in a zip or jar
file. The zip file could also contain some sort of (XML?) manifest
file that could contain metadata about the collection as a whole. Any
metadata associated with an individual profile would be stored as a
global attribute in the appropriate netCDF file. Editing a profile
would be as simple as extracting the netCDF file from the archive,
rewriting it, and then storing it back in the jar file.
This is a good solution sometimes, but not generally. Many small files
are not optimal for large archives. We are having trouble on
motherlode right now with excessive inode consumption. Unzipping is
too costly if the data is accessed often.
To make it even easier for consumers of this data, I would also
restrict the data type of all variables to double. Also, all four
x,y,z,t coordinates
would be required.
I also lean to requiring x,y,z,t coordinates, but others arent so
sure. Note this is not the same as having x,y,z,t dimensions. In fact
this is a very important part of the proposal that deserves to be
highlighted.
Im claiming that the general way to do coordinate systems for this
kind of data looks something like
variables;
float lon(obs);
float lat(obs);
float z(obs);
double time(obs);
float dataVar(obs);
dataVar:coordinates = “lon lat z time”;
rather than follow gridded data conventions like COARDS and use
variations of:
float dataVar(t,z,y,x);
I think this is what you are saying below.
Some examples (from your CDL examples):
Collection of point data
------------------------
Unchanged (just one file in archive)
Collection of profile data
--------------------------
For each netCDF file:
variables;
double lon(1);
double lat(1);
double z(obs);
double time(1);
double humidity(obs);
double temperature(obs);
double pressure(obs);
Collection of trajectories
--------------------------
For each netCDF file:
variables;
double lon(obs);
double lat(obs);
double z(obs);
double time(obs);
double humidity(obs);
double temperature(obs);
double pressure(obs);
Station time series
-------------------
variables;
double lon(1);
double lat(1);
double z(1);
double time(obs);
double temperature(obs);
I think this looks fine, exccept I want to also cover the case where
someone needs to put more than one thing in a file.
Thanks for your input.
John