Re: [thredds] enrich thredds xml catalog using external tool

To: Chiara Scaini <saetachiara@xxxxxxxxx>
Subject: Re: [thredds] enrich thredds xml catalog using external tool
From: Antonio S. Cofiño <cofinoa@xxxxxxxxx>
Date: Wed, 25 Jul 2018 11:42:03 +0200

Hola Chiara,


On 20/07/18 11:02, Chiara Scaini wrote:

Hi Antonio,
actually I'm thinking that it may be a good idea to write somethingsimilar to "<addTimeCoverage datasetNameMatchPattern" that allows meto add a different metadata based on a regex on filename (whichcontains the date). That would allow me to use the datasetscan (whichis more reliable than writing my own xml catalog) and enrich specificentries based on date. Where is the source code for theaddTimeCoverage and addDatasetSize? I looked for it in github butcould not find it.

In fact you can enhance datasets using a user-defined class to thispurpose.

The datasetScan element can contain multiple datasets enhancers by usingdatasetEnhancerImpl element. This example is *not* fully functionalbecasue the RegExpAndDurationTimeCoverageEnhancer doesn't implement theright constructor, but it will illustrate how to use this no-documentedfeature


<datasetScan name="Test DatasetEnhancerImpl" ID="testDatasetEnhancerImpl"
               path="testDatasetEnhancerImpl" location="content/testdata">
    <metadata inherited="true">
      <serviceName>all</serviceName>
      <dataType>Grid</dataType>
    </metadata>

    <filter>
      <include wildcard="*eta_211.nc"/>
    </filter>

<datasetEnhancerImplclassName="thredds.cataloggen.datasetenhancer.RegExpAndDurationTimeCoverageEnhancer"> <parametersdatasetNameMatchPattern="([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})_eta_211.nc$"

                startTimeSubstitutionPattern="$1-$2-$3T$4:00:00"
                                    duration="60 hours"
      />
    </datasetEnhancerImpl>

  </datasetScan>

The dataset enhancer class must implement the Java interfacethredds.cataloggen.DatasetEnhancer, and have a class constructor withone, and only one, object as argument. You have some details at :https://github.com/Unidata/thredds/blob/master/cdm/src/main/java/thredds/cataloggen/DatasetEnhancer.java

With respect to the addTimeCoverage source code the corresponding classis RegExpAndDurationTimeCoverageEnhancer which implementsDatasetEnhancer, but no the right constructor.

Looking at the RegExpAndDurationTimeCoverageEnhancer class you will findthe addMetadata method where the time coverage properties for thedataset are set. Equivalent approach could be follow for harvest property7

I don't have experience using this feature, but I mentioning just incase you want to test it and want to share your experience.

IMPORTANT: This feature it's no documented and I would understand thatUnidata's developers don't support it.

Also, I checked the thredds cache and it's empty. Do you know if thefinal catalog resulting from the datasetscan is stored somewhere inthe server? I could /wget/ it but since the files are nested, I wouldnever get the complete catalog tree... If I had the complete catalog Icould modify it and add, for example. the harvesting attribute basedon the current date. The result would be something like this (2 nestedfolders and the data, and the harvesting flag added by a python script):

The catalogs documents are generated in-memory dynamically and they areno persisted in disk.


Antonio

<dataset name="WRF 2018" ID="testWRF/2018"><metadatainherited="false"><keyword>Parent</keyword></metadata><metadatainherited="true"><serviceName>all</serviceName><dataType>GRID</dataType><documentationtype="summary">This is a summary for my test ARPA catalog for WRFruns. Runs are made at 12Z and 00Z, with analysis an d forecasts every6 hours out to 60 hours. Horizontal = 93 by 65 points, resolution81.27 km, LambertConformal projection. Vertical = 1000 to 100 hPapressure levels.</documentation><keyword>WRFoutputs</keyword><geospatialCoverage><northsouth><start>25.0</start><size>35.0</size><units>degrees_north</units></northsouth><eastwest><start>-20.0</start><size>50.0</size><units>degrees_east</units></eastwest><updown><start>0.0</start><size>0.0</size><units>km</units></updown></geospatialCoverage><timeCoverage><end>present</end><duration>5years</duration></timeCoverage><variablesvocabulary="GRIB-1"/><variables vocabulary=""><variable name="Z_sfc"vocabulary_name="Geopotential H" units="gp m">Geopotential height,gpm</variable></variables></metadata> <dataset name="WRF 2018-03-19T00:00:00"ID="testWRF/2018/20180319_00"><metadatainherited="false"><keyword>Parent</keyword></metadata> <dataset name="WRF Domain-03 2018-03-23T00:00:00"ID="testWRF/2018/20180319_00/wrfout_d03_2018-03-23_00:00:00"urlPath="WRF/2018/20180319_00/wrfout_d03_2018-03-23_00:00:00"><dataSizeunits="Mbytes">137.2</dataSize><datetype="modified">2018-06-28T10:27:07Z</date><timeCoverage><start>2018-03-23T00:00:00</start><duration>6hours</duration></timeCoverage><keyword>Children</keyword></dataset> <dataset name="WRF Domain-03 2018-03-20T18:00:00"ID="testWRF/2018/20180319_00/wrfout_d03_2018-03-20_18:00:00"urlPath="WRF/2018/20180319_00/wrfout_d03_2018-03-20_18:00:00"><dataSizeunits="Mbytes">137.2</dataSize><datetype="modified">2018-06-28T10:27:13Z</date><timeCoverage><start>2018-03-20T18:00:00</start><duration>6hours</duration></timeCoverage><keyword>Children</keyword></dataset> <dataset name="WRF Domain-02 2018-03-20T00:00:00"ID="testWRF/2018/20180319_00/wrfout_d02_2018-03-20_00:00:00"urlPath="WRF/2018/20180319_00/wrfout_d02_2018-03-20_00:00:00"><dataSizeunits="Mbytes">472.4</dataSize><datetype="modified">2018-06-28T10:27:01Z</date><timeCoverage><start>2018-03-20T00:00:00</start><duration>6hours</duration></timeCoverage><keyword>Children</keyword></dataset> <dataset name="WRF Domain-01 2018-03-23T00:00:00"ID="testWRF/2018/20180319_00/wrfout_d01_2018-03-23_00:00:00"urlPath="WRF/2018/20180319_00/wrfout_d01_2018-03-23_00:00:00"><dataSizeunits="Mbytes">101.9</dataSize><datetype="modified">2018-06-28T10:26:57Z</date><timeCoverage><start>2018-03-23T00:00:00</start><duration>6hours</duration></timeCoverage><keyword>Children</keyword></dataset> <dataset name="WRF Domain-01 2018-03-20T00:00:00"ID="testWRF/2018/20180319_00/wrfout_d01_2018-03-20_00:00:00"urlPath="WRF/2018/20180319_00/wrfout_d01_2018-03-20_00:00:00"><dataSizeunits="Mbytes">101.9</dataSize><datetype="modified">2018-06-28T10:27:10Z</date><timeCoverage><start>2018-03-20T00:00:00</start><duration>6hours</duration></timeCoverage><keyword>Children</keyword>*<harvest>true</harvest>*</dataset>
   </dataset>
   </dataset>

Many thanks,
Chiara
On 19 July 2018 at 15:51, Chiara Scaini <saetachiara@xxxxxxxxx<mailto:saetachiara@xxxxxxxxx>> wrote:
    Hi Antonio, thanks for answering!
    The easiest thing for me would be using a python script that reads
    the data from the database and modifies the xml (ex. using lxml
    library, https://lxml.de/).

    Would namespaces be used similarly to this simple example? I just
    added a test node 'mycustomfield' to a thredds catalog dataset entry.
    <catalog version="1.0.1><service ..... />
    <dataset name="WRF Domain-03 2018-03-23T00:00:00"
    ID="testWRF/2018/20180319_00/wrfout_d03_2018-03-23_00:00:00"
    urlPath="WRF/2018/20180319_00/wrfout_d03_2018-03-23_00:00:00"><dataSize
    units="Mbytes">137.2</dataSize><date
    
type="modified">2018-06-28T10:27:07Z</date><timeCoverage><start>2018-03-23T00:00:00</start><duration>6
    hours</duration></timeCoverage><myns:mycustomfield
    xmlns:myns="myurl">My custom stuff</myns:mycustomfield></dataset>
    </catalog>

    Regarding the catalog: I can't modify single datasets in the
    catalog.xml because it only contains the datasetscan. Some
    metadata can be added for all entries at the datasetscan level,
    but others are specific (ex. if the specific file was archived or
    not, and when). Is it possible to enable something that writes the
    catalog in a temp file of some kind? What do you mean by: "catalog
    entries generated by datasetScan are created in-memory and they
    are cached/persisted (???) In specific storage format."? If it's
    cached, I should be able to retrieve it somehow.

    As for the dynamic catalog, the documentation says /'Dynamic
    catalogs are generated by DatasetScan
    
<https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html>
    elements, at the time the user request is made. These catalogs are
    not cached'/ so, it I understood correctly, I can't create a text
    file out of it and modify it.


    Thanks,
    Chiara

    On 19 July 2018 at 14:47, Antonio S. Cofino <cofinoa@xxxxxxxxx
    <mailto:cofinoa@xxxxxxxxx>> wrote:

        Hola Chiara,

        To "enrich" the TDS catalog you can use XML namespaces [1].
        That allows to use more than one XML schema.

        In fact the TDS already uses it for the datasetScan to point
        using XLink schema to the directories been scanned.

        With respect to the datasetScan feature to create a proxy of
        an absolute latest atomic dataset, it would require to create
        an new datasetScan element.

        Meanwhile you can create/modify the catalogs using an external
        tool.

        I would recommend use a tool/library which is XML "aware" to
        guarantee well formed and semantically correct XML documents,
        but using other tool would fit your purpose.

        Take into account that catalog entries generated by
        datasetScan are created in-memory and they are
        cached/persisted (???) In specific storage format.

        One interesting feature in the TDS5.0 version are the dynamic
        catalogs, similar to a catalogScan. But it has not been
        officially released but the current beta version already
        implements it.

        Antonio S. Cofino


        [1] https://www.w3schools.com/xml/xml_namespaces.asp
        <https://www.w3schools.com/xml/xml_namespaces.asp>



        On 19 Jul 2018 12:37, "Chiara Scaini" <saetachiara@xxxxxxxxx
        <mailto:saetachiara@xxxxxxxxx>> wrote:

            Hi all, I'm setting up a thredds catalog to be used by
            Geonetwork.

            The catalog contains meteorological data, but will be
            enriched by other data sources (ex. a table containing the
            list of records that were moved to a backup facility and
            are no longer available on disk, or a table containing
            pictures related to the files).

            Is it possible to enrich the xml file with other data (ex.
            inserting xml nodes directly into the file) without
            breaking thredds functionalities? What strategy do you
            recommend (ex. a bash script to modify the xml, or...?).

            Note that I'm using a <datasetScan> to recursively get all
            items in a nested folder structure, so I would like to
            modify the 'real' xml catalog that contains all the nodes
            (some information should to be inserted at the container
            level, others at the data level).

            Many thanks,
            Chiara
--Chiara Scaini
            _______________________________________________
            NOTE: All exchanges posted to Unidata maintained email
            lists are
            recorded in the Unidata inquiry tracking system and made
            publicly
            available through the web.  Users who post to any of the
            lists we
            maintain are reminded to remove any personal information
            that they
            do not want to be made public.


            thredds mailing list
            thredds@xxxxxxxxxxxxxxxx <mailto:thredds@xxxxxxxxxxxxxxxx>
            For list information or to unsubscribe, visit:
            http://www.unidata.ucar.edu/mailing_lists/
<http://www.unidata.ucar.edu/mailing_lists/>
--Chiara Scaini
--
Chiara Scaini

References:
- [thredds] enrich thredds xml catalog using external tool
  - From: Chiara Scaini
- Re: [thredds] enrich thredds xml catalog using external tool
  - From: Antonio S. Cofino
- Re: [thredds] enrich thredds xml catalog using external tool
  - From: Chiara Scaini
- Re: [thredds] enrich thredds xml catalog using external tool
  - From: Chiara Scaini