NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
Peter Cornillon wrote:
we expect that data holdings can be divided into two categories. 1) sites in which the monitoring (eg crawling) can be done occasionally (once a day, once an hour, once a week?), and the impact of the crawling is therefore minimal. 2) real-time sites that have constantly changing data. For these, we probably need a different strategy, and we are considering instrumenting the LDM as one possible solution.But in sites that are being continuously updated, it seems to methat you need a local inventory, a file or some other way of keeping track of the contents of a data set. This is our notion of a file server or your configuration file in the Aggregation Server. This is the thing that you want to discover when searching for data sets, not all of the files (or granules or whatever) in the data set. This is what we are wrestling with in the crawler thatwe are looking at. In particular, I have asked Steve to look at ways of having the crawler group files into data sets automatically and then to reference the inventory for the data set rather than the entire data set and to make the crawler capable of updatingthe inventory.
Just to make sure i understand your terminology: files = physical files datasets = logical files we want the user to see inventory = listing of datasets granule = ?? question: what does it mean to "group files into data sets"? like the agg server?
Our hope is that the crawler would work locally building the inventory locally and could be made to run as oftenas you like. However, the inventory need not reside at the site containing the actual data and the crawler could be run from aremote site as our prototype does. The point here is that there are two types of crawlers generating two types of lists, onethat generates inventories of granules in data sets (generally locally and can be run as often at you like) and the other generating inventories of data sets - directories (generally run remotelyless often). Finally, I note that the inventory could be generatedin other ways, for example every time a granule is added to a data set, the inventory could automatically be updated. I reallysee the inventory issue as a local process. What is strange is the number of data sets that we encounter that do not have a formal inventory and this is what gives rise to this problem.
Some possible terminology clarifications:We have been using the word "crawler" to mean a process that gets all of its information from the web/DODS server. So it cant see local disk files, but can be run remotely.
A process that must run locally, and can have access to whatever files exists, we have been calling a "scanner" as in disk scanner.
Generating "inventories of granules in data sets" makes sense in the context of an agg server, but is there also meaning to it in the context of a normal DODS server?
thredds
archives: