NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: THREDDS/DLESE Connections slides



Brief comment on the obvious:  It is less important what the agreed definition 
of a
"data set" (etc. for "collection", "catalog", "directory", etc.) is than that 
there BE
an agreed definition. I suggest that someone should circulate an authoritative 
DODS
glossary before the meeting.  It could save hours of definitional confusion.
(Personally I like the simple definition "In a DODS server, a dataset is 
something you
can get a DAS and DAP from."  Maybe this should be the def'n of a "DODS data 
set".)

John:  Any thoughts you'd care to share prior to the meeting about the 
potential for a
DODS web crawler ("harvester", "scanner", ... more glossary issues) 
automatically to
produce a single giant thematic "DODS collection" in the THREDDS framework?

    - steve

===========================================

Peter Cornillon wrote:

> John Caron wrote:
> >
> > Peter Cornillon wrote:
> >
> > >>Just to make sure i understand your terminology:
> > >>
> > >>files = physical files
> > >>
> > >
> > > YUP
> > >
> > >
> > >>datasets = logical files we want the user to see
> > >>
> > >
> > > I don't think about datasets in a file concept. It could be a group of
> > > files, a single file,... I guess that the reason that I don't think
> > > about it that way is that the data need not be in digital form to be
> > > grouped in a data set. Beach profiles that have been collected over
> > > the past 50 years and consist of pages of numbers - monthly values of
> > > depth below mean low water at specified distances from a marker in a
> > > given direction would qualify. I suppose that your definition is
> > > correct from a computer perspective, I just don't think of it that way.
> >
> > ok, i didnt really mean to use the word "file". how about:
> >
> > "a dataset is a logical grouping of data, associated in some meaningful way 
> > from
> > the user's perspective."
>
> Yup.
>
> > In a DODS server, a dataset is something you can get a DAS and DAP from.
>
> Well not really. You can only get a DDS and DAS from a data set IF it is
> either a sinlge file or has a description in a file server or now in the
> Aggregation Server.
>
> > in THREDDS, a "collection" is a collection of datasets, for which the above
> > definition also works just fine. so whats the difference between a dataset 
> > and a
> > collection?
>
> At URI we have a half dozen SST datasets derived from the AVHRR sensors:
> one for the area off of Cape Hatteras, another for the Great Lakes, ...
> Each has on the order of 15,000 passes in it. I assume that you would
> call the ensemble of these a collection?
>
> > this is the same issue that Benno has pointed out: in his DODS
> > server, there is no distinction between collections and datasets, because 
> > the
> > server seamlessly moves between collections, physical files, and the fields 
> > in
> > the files, presenting a uniform API of datasets with their DAP and DAS.
>
> But, you would be hard pressed to aggregate the things that I call datasets
> at URI (the Hatteras one with the Great Lakes one) with your Aggregation 
> Server.
> As I noted in my previous e-mail the actual grouping of data into a dataset is
> arbitrary, so one could call the collection of datasets at URI a dataset or 
> one
> could refer to each one as a dataset. One could call all data at a site a
> data set, or in the extreme, all earth science data accessible via DODS as
> a dataset.
>
> > (I am not going to try to answer the question of what's the difference 
> > between a
> > catalog and a collection yet; hopefully others might have some ideas)
> >
> > in THREDDS, a dataset has a URI, and is the smallest choosable thing in the
> > catalog.
>
> I think that this is pretty much what we refer to as a directory, although
> we are still working on making a single URL for each dataset described in
> the various directories.
>
> > our goal as middleware is to present the list of dataset choices to the
> > user very quickly, without having to actually contact the server. once the 
> > user
> > selects a dataset, then the user can expect some delay while a connection is
> > made to the server, and the "real" dataset metadata is collected. This 
> > implies
> > that the catalog metadata may not be exactly right at all times (eg the 
> > list of
> > available times of the dataset), which makes life easier for implementors.
> >
> > >
> > >
> > >>inventory = listing of datasets
> > >>
> > >
> > > No, a listing of datasets is what I refer to as a directory (not a
> > > directory on a computer). The GCMD is an example of same. An
> > > inventory is a listing of elements in a data set, it could be a
> > > list of times for satellite images in an archive along with the
> > > physical location of the data (tape C18341 on a rack, or
> > > N861230147.hat in a computer directory on my machine) or a list
> > > of times and locations of each XBT in an XBT archive.
> >
> > so is an inventory an internal thing that the server uses to construct the
> > datasets that are visible to the outside world?
>
> I don't think so. First, it need not be internal. For a long time
> we maintained inventories of the data sets at JPL. The inventory
> is simply a list of the contents of a dataset. A dataset can
> exist without an inventory, in that the dataset is a logical
> grouping of the data. The GCMD identifies a lot of datasets
> that to the best of my knowledge do not have inventories. Well,
> in a sense they do in that they might often comprise all of the
> files in a directory on a computer, so the directory listing is
> to some extent an inventory of the data in the dataset.
>
> > >>question:
> > >>what does it mean to "group files into data sets"? like the agg server?
> > >>
> > >
> > > One mightsay that all images in this projection, from this satellite,
> > > processed this way form a data. Or one could say that all images in
> > > this projection, from this suite of satellites processed this way
> > > form a data set. Or... This is the trouble with data sets, different
> > > people call different groupings of the data a data set. This caused
> > > a lot of blood letting between NASA and NOAA a number of years back.
> > > The idea is NOT to call every granule or every file in the system a
> > > data set, you know the difference between lumpers and splitters. In
> > > order for us to make progress, we have to back off a bit and look at
> > > the big picture, grouping things into data sets allows us to do that.
> > > This is exactly the problem that the DODS crawler has. When it crawls
> > > a site such as our satellite archive, it ends up with thousands of
> > > entries and the system or the person viewing the results struggles
> > > with a data overload, more information that s/he/it (humm... have
> > > to be careful with these gender neutral versions) wants or needs to
> > > locate the group of files that define the object of interest. Given
> > > that there is no precise definition for how to group files into a
> > > data set, I think that we can reduce the amount of information that
> > > we have to deal with to a reasonable view of the all the data on the
> > > system without losing much if anything. The crawler is likely to group
> > > the files slightly differently in some cases than the human would, but
> > > one could probably discover this pretty quickly and steer the crawler
> > > if necessary.
> >
> > ok, this seems to be similar to the "collections" vs "datasets" issue 
> > above. I
> > think i need to hear Steve's tech presentation before I can understand this 
> > any
> > deeper.
> >
> > >
> > >>Generating "inventories of granules in data sets" makes sense in the 
> > >>context of
> > >>an agg server, but is there also meaning to it in the context of a normal 
> > >>DODS
> > >>server?
> > >>
> > >
> > > Not sure exactly what you mean here. We have file servers which are
> > > inventories of granules in data sets. Actually the terminology is a
> > > bit loose here also. The server in this case is a DODS FreeForm server.
> > > It serves a table that contains a list of URLs with the characteristic(s)
> > > that differentiate one URI from another, time in the case of our satellite
> > > archives.
> >
> > i think some of the problem is that i think of DODS narrowly as a specific
> > client/server protocol, and you include services and extensions that have 
> > been
> > built with or use that protocol.
>
> Yes! The DODS DAP is the thing that defines the low level data access
> protocol. To use it effectively one needs to add higher level constructs
> such as the file server.
>
> Peter
> --
>  Peter Cornillon
>   Graduate School of Oceanography  - Telephone: (401) 874-6283
>    University of Rhode Island      -       FAX: (401) 874-6728
>     Narragansett RI 02882  USA     -  Internet: address@hidden