random distracting comments are below:
John Graybeal wrote:
On Aug 20, 2009, at 9:54 AM, Tom Whittaker wrote:
One of the single biggest mistakes that the meteorological community
made in defining a
distribution format for realtime, streaming data was BUFR -- because
the "tables" needed
to interpret the contents of the files are somewhere else....and
sometimes, end users cannot find them!
Perhaps this is a problem with the way the tables are made available,
and not simply the fact they are separate from the data stream? After
all, many image files (for example) are not described internally at all,
but no one seems to have trouble working with those images.... (I know
that's oversimplifying the difference, but it's instructive nonetheless.)
part of the problem is indeed the "way the tables are made available": no registry of
canonical versions, mistakes in the "official WMO table" (!), non machine-readable
official WMO table (!!).
but the biggest problem is deeply part of the BUFR design: one needs the tables
to parse the BUFR message at the syntactic level. Other external tables (eg
GRIB) are at the semantic level, so you can still extract the numbers even if
you dont know what they mean. But BUFR requires the tables to simply parse the
message, which makes the above issues with the tables fatal, or worse: if your
tables are wrong, your reader can silently return erroneous values.
So "self-describing" on both the syntactic and semantic level == good.
NetCDF and ncML maintain the essential metadata within the files:
types, units, coordinates -- and I strongly urge you (or whomever) not
to make the "BUFR mistake" again -- put the metadata into the files!
Maybe you think all the essential metadata is within the netCDF file,
but in my opinion it isn't. I often find the essential metadata,
particularly of the semantic variety, to be absent. And I know of
communities that have had significant difficulty with the provenance
(for example) within CF/netCDF files.
The generalization (point) of this observation is that different people
require different metadata, sometime arbitrarily complex or peripheral
metadata. And I don't think you want ALL that metadata in the same file
as the data -- especially when the data may be coming not in a file, but
in a stream of records.
Yes, in contrast to my claim above for self-describing files, you are stating the "metadata
incompleteness theorum" that says "it is impossible to put all essential metadata in a
file". Proof by induction: for any given set of metadata, I can find some user who needs one
more piece of information not in that set. QED ;^{
(Thats why the TDS allows arbitrary metadata annotations that can be added to
the dataset without having to rewrite it. Doesnt refute the theorum, but does
allow for solving the problem for your friends ;^)
Do not require the end user to have to have an internet connection to
simply "read" the data....
many people download the files and then take them along" when
traveling, for example.
Ah, in the era of linked data, or LinkedData [2] -- which will be our
era in 5 years from now, if not already -- this problem will be solved,
because all will insist on having the internet connection when they are
traveling. Witness the trajectory of internet availability at scientific
conferences.
True but still, creating a self contained dataset is good, for reasons of
Keeping Things Simple.