NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
Hi Chiara, On 05/07/18 08:54, Chiara Scaini wrote:
I don't have experience with GeoNetwork, but looking it's documentation what you want its to harvest thredds catalogs *from* GeoNetwork. From my POV the "problem" is that GeoNetwork it's harvesting datasets (atomic or collections) that were already harvested and you don' want to do it again. In fact, in the GeoNetwork example for harvesting a remote thredds catalog, they it is ignoring the "harvest" attribute of the dataset. It may be it's beyoond of my knowledge but it would very easy for GeoNetwork, to "cache" already harvested metadata based on the ID attribute of the thredds datasets (collections and atomic ones). The harvest attribute it's only applied to the root of the datasetScan collection and it's not inherent because it's not a metadata element.Hi Antonio and thanks for answering. I'm using the version 4.6.11.Here's an example of 2 folders of the Filesystem containing data from 2 different models. I'm currently creating 2 different datasetScan so data are located in different folders in my thredds catalog./filesystem/operative/data/wrfop/outputs/2018 /filesystem/operative/data//ariareg/farm/air_quality_forecast/2018Each folder contains several daily folders with files that I can filter by name, (ex. <include wildcard="*myfile_*"/>) /filesystem/operative/data//ariareg/farm/air_quality_forecast/2018/20180620_00/myfile.nc <http://myfile.nc>But my aim is to harvest data from Thredds to Geonetwork only once per file. Since Geonetwork can account for the 'harvest' attribute, I would like to _set the harvest attribute to 'false' for all data but the newly created_. Do you think that's possible with the current functions?
A workaround would be to create a temporary folder (and catalog) to be used for harvesting. A crontab job is creating the new data in the filesystem everyday and it cancreate the link too. The catalog would contain symbolic links and the attribute "harvest=true". The links would be deleted and replaced daily from crontab. Once imported to Geonetwork, I would of course modify the thredds links to point to the main catalog and not to a 404.
That could be a solution, like the latest dataset in: http://motherlode.ucar.edu/thredds/catalog/satellite/3.9/WEST-CONUS_4km/catalog.htmlIt May be you could use the latest or proxy dataset feature described in [1]. This allows to generate a proxy dataset wich points to latest "added" dataset and provided the URL to the latest dataset to the GeoNetwork harvester. I'm not sure if this solves your issue, but it worth it the try.
This an example: http://motherlode.ucar.edu/thredds/catalog/nws/metar/ncdecoded/files/catalog.htmlThe other option could be to "regenerate" the catalogs dynamically and trigger a catalog reload to the TDS instance. This quite similar to your option, but more dynamic although require more mchinery to complete.
Here's what i got so far with the 'harvest' attribute set at the datasetScan level. I did what you suggested about the filter:<filter> <include wildcard="*wrfout_*"/> </filter> <filter> <include collection="true"/> </filter> <filter> <include atomic="true"/> </filter>The harvest attribute is not set for the inner dataset nodes, but only for the dataset parent. Is that what I should expect?
Yes, the harvest it's only for the parent collection dataset.
<catalog version="1.0.1"><service name="all" serviceType="Compound" base=""><service name="odap" serviceType="OPENDAP" base="/thredds/dodsC/"/><service name="http" serviceType="HTTPServer" base="/thredds/fileServer/"/><service name="wms" serviceType="WMS" base="/thredds/wms/"/><service name="ncss" serviceType="NetcdfSubset" base="/thredds/ncss/"/></service><dataset name="AUXILIARY" harvest="true" ID="testAUXILIARY"><metadata inherited="true"><serviceName>all</serviceName><dataType>GRID</dataType><documentation type="summary">This is a summary for my test ARPA catalog for WRF runs. Runs are made at 12Z and 00Z, with analysis an d forecasts every 6 hours out to 60 hours. Horizontal = 93 by 65 points, resolution 81.27 km, LambertConformal projection. Vertical = 1000 to 100 hPa pressure levels.</documentation><keyword>WRF outputs</keyword><geospatialCoverage><northsouth><start>25.0</start><size>35.0</size><units>degrees_north</units></northsouth><eastwest><start>-20.0</start><size>50.0</size><units>degrees_east</units></eastwest><updown><start>0.0</start><size>0.0</size><units>km</units></updown></geospatialCoverage><timeCoverage><end>present</end><duration>5 years</duration></timeCoverage><variables vocabulary="GRIB-1"/><variables vocabulary=""><variable name="Z_sfc" vocabulary_name="Geopotential H" units="gp m">Geopotential height, gpm</variable></variables></metadata><dataset name="wrfout_d03_test7" ID="testAUXILIARY/wrfout_d03_test7" urlPath="AUXILIARY/wrfout_d03_test7"><dataSize units="Mbytes">137.2</dataSize><date type="modified">2018-06-28T10:19:28Z</date></dataset><dataset name="wrfout_d03_test6" ID="testAUXILIARY/wrfout_d03_test6" urlPath="AUXILIARY/wrfout_d03_test6"><dataSize units="Mbytes">137.2</dataSize><date type="modified">2018-06-28T10:19:28Z</date></dataset></dataset></catalog>Thanks for your time, Chiara
Hope this helps Regards Antonio[1] https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html#Adding_Proxy_Datasets
On 4 July 2018 at 19:46, Antonio S. Cofiño <cofinoa@xxxxxxxxx <mailto:cofinoa@xxxxxxxxx>> wrote:Hi Chiara, I'm answering inline. On 04/07/18 18:23, Chiara Scaini wrote:Hi all, I'm setting up a geospatial data and metadata portal based on thredds catalog and the Geonetwork engine and web application. I am working on Linux CentOS and my applications are deployed with Tomcat8.Which TDS version are you using?I am populating a thredds catalog based on a filesystem containing meteorological data. Geonetwork then harvests the catalog and populates the application. However, and given that I'm updating the data on the web side, I would like to harvest only once the data. I tried to set the 'harvest' attribute from the catalog, but without success. Here's an excerpt of my catalog.xml file:The "harvest" it's been only defined as attribute for dataset (and datasetScan) elements, but IMO it's no the purpose you are looking for (see [1])<datasetScan name="AUXILIARY" ID="testAUXILIARY" path="AUXILIARY" location="content/testdata/auxiliary-aux" harvest="true">This harvest is correct.<metadata inherited="true"> <serviceName>all</serviceName> <dataType>Grid</dataType> <dataFormatType>NetCDF</dataFormatType> <DatasetType harvest="true"></DatasetType> <harvest>true</harvest>This hrvest it's not defined in the THREDDS Client Catalog Specification (see [1])<keyword>WRF outputs</keyword> <documentation type="summary">This is a summary for my test ARPA catalog for WRF runs. Runs are made at 12Z and 00Z, with analysis an d forecasts every 6 hours out to 60 hours. Horizontal = 93 by 65 points, resolution 81.27 km, LambertConformal projection. Vertical = 1000 to 100 hPa pressure levels.</documentation> <timeCoverage> <end>present</end> <duration>5 years</duration> </timeCoverage> <variables vocabulary="GRIB-1" /> <variables vocabulary=""> <variable name="Z_sfc" vocabulary_name="Geopotential H" units="gp m">Geopotential height, gpm</variable> </variables> </metadata> <filter> <include wildcard="*wrfout_*"/> </filter>How files are distributed on disk? they are under directories? If yes the you need to add a include filter with the collection attribute="true" (see [2] and [3])<addDatasetSize/> <addTimeCoverage datasetNameMatchPattern="([0-9]{2})_([0-9]{4})-([0-9]{2})-([0-9]{2})_([0-9]{2}):([0-9]{2}):([0-9]{2})" startTimeSubstitutionPattern="$2-$3-$4T$5:00:00" duration="6 hours" /> <namer> <regExpOnName regExp="([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})" replaceString="WRF $1-$2-$3T$4:00:00" /> <regExpOnName regExp="([0-9]{2})_([0-9]{4})-([0-9]{2})-([0-9]{2})_([0-9]{2}):([0-9]{2}):([0-9]{2})" replaceString="WRF Domain-$1 $2-$3-$4T$5:00:00" /> </namer> </datasetScan> Even if I set the harvest="true" attribute, it is not inherited by the datasets and thus the harvester does not get the files. I can also ignore the 'harvest' attribute while harvesting, but my aim is to harvest only new files using an auxiliary catalog that contains symbolic links (and updating the Thredds path after harvesting). Am I missing something? How would you sistematically add the harvest attribute to all inner datasets in a nested filesystem? Or, would it make sense to create two catalogs using the time filter options (ex. all up to yesterday in one catalog, and today's files in another)? Can you show me an example of usage of those filters in a datasetScan? Many thanks, ChiaraHow this helps Regards Antonio [1] https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogSpec.html#dataset <https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogSpec.html#dataset> [2] https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogServerSpec.html#datasetScan_Element#filter_Element <https://www.unidata.ucar.edu/software/thredds/current/tds/catalog/InvCatalogServerSpec.html#datasetScan_Element%23filter_Element> [3] https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html#Including_Only_the_Desired_Files <https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html#Including_Only_the_Desired_Files> -- Antonio S. Cofiño Dep. de Matemática Aplicada y Ciencias de la Computación Universidad de Cantabria http://www.meteo.unican.es-- Chiara Scaini_______________________________________________ NOTE: All exchanges posted to Unidata maintained email lists are recorded in the Unidata inquiry tracking system and made publicly available through the web. Users who post to any of the lists we maintain are reminded to remove any personal information that they do not want to be made public. thredds mailing list thredds@xxxxxxxxxxxxxxxx <mailto:thredds@xxxxxxxxxxxxxxxx> For list information or to unsubscribe, visit:http://www.unidata.ucar.edu/mailing_lists/<http://www.unidata.ucar.edu/mailing_lists/>-- Chiara Scaini
thredds
archives: