Re: [thredds] How to create / load aggregation cache

To: Jordi Domingo Ballesta <jordi.domingo@lobelia.earth>
Subject: Re: [thredds] How to create / load aggregation cache
From: Michael McDonald <mcdonald@xxxxxxxxxxxxx>
Date: Fri, 30 Jun 2023 17:39:03 -0400
Jordi,
We use Nagios (the free "Core" release that comes with most Linux
distros) to continuously check/trigger the "1st access" penalty of our
HYCOM.org datasets that change or have expired caches for our THREDDS
servers (https://tds.hycom.org/thredds). We check/touch each dataset's
OPENDAP access/form pages, and the NCSS forms, i.e., the point at
which the dataset has been successfully scanned and is ready for
access. To account for some datasets that take longer than others to
index/scan we allow for up to 1200 seconds to complete. The advantage
of Nagios is that it will keep trying if there is a
critical/warning/timeout. When the dataset is finally indexed and
available for use the Nagios response will be "Green across the board"
and will take milliseconds to respond. Nagios is also useful for
checking a lot of other server services (tomcat, memory, CPU, etc).

e.g.,
nagios custom command name: long_check_http
$USER1$/check_http -H $HOSTNAME$ -w 300 -c 600 -t 1200 $ARG1$

service name:
long_check_http!-u /thredds/dodsC/GLBy0.08/expt_93.0/ssh.html

Nagios requesting this "OPeNDAP Dataset Access Form" page will trigger
(and wait) for this to complete and then return a status of OK. If
there are datasets that return CRITICAL or WARNING after a few minutes
then that is a sign of some other system issue, e.g., disk/NFS I/O,
misconfiguration, etc.

$ /usr/lib64/nagios/plugins/check_http -H tds.hycom.org -w 300 -c 600
-t 1200 -u /thredds/dodsC/GLBy0.08/expt_93.0/ssh.html
HTTP OK: HTTP/1.1 200 200 - 19245 bytes in 0.053 second response time
|time=0.052748s;300.000000;600.000000;0.000000 size=19245B;;;0


https://tds.hycom.org/thredds/dodsC/GLBy0.08/expt_93.0/ssh.html


Regarding your large dataset issue, I'd advise you give the FMRC
feature of THREDDS a try but only use this on the "incoming"
(new/changing) parts of the dataset. We do this for the incoming
forecast data (a separate folder) and then flatten/keep only the parts
needed to add/extend to our existing time series, e.g., we have daily
forecast runs that go from hour t000 to t180, which the FMRC can
easily handle, combine, and merge, but we only copy (keep/save) the
t000~t023 files so that we do not have time index duplicates/overlap
in the main/growing dataset aggregations.

The trick is to have THREDDS only scan/index the parts that are
changing. The parts that do not change should be scanned once and the
cache file preserved until it expires (you specify this in the
config).  You could do this a number of ways. We do this typically "by
year" where the datasetScan touching the current/active 2023 data and
the joinExisting aggregation has a "recheckEvery="60 min", which
THREDDS keys on to determine if X time has passed since the last index
to determine if a re-index on a Tomcat restart is needed (note: a
quick tomcat restart/bounce is the only way we can "reliably" trigger
a dataset update when new data arrives). The other years older than
2023 are not getting updated, so they do not have the "recheckEvery"
set for their joinExisting aggregations.

e.g., side-note: do not use "backslashes" or other special chars in
your dataset "ID" values, as this will produce weird cache issues
(learned this the hard way). The dataset ID is global (no duplicates)
but this is the key (the fiename) used for creating/keeping the agg
cache files under /var/lib/tomcat/content/thredds/cache/agg. The
"urlPath" is the part used to access from the web UI and for use when
"combining" multiple aggregations into a top "dataset" aggregation.
e.g., each "/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/YYYY"
is an independently scanned/cached dataset (an OPENDAP object) that
can be reused again in another joinExisting or unions. NOTE that union
operations do not produce an agg cache file.

a hybrid example (see below) of multiple aggregations for one of our
datasets. We also have our catalogs coded in puppet templates for easy
updates/deployments across our multiple load-balanced tomcat+thredds
servers.

<dataset name="* ALL DATA/YEARS *"
ID="GOMu0.04-expt_90.1m000" urlPath="GOMu0.04/expt_90.1m000">
<serviceName>all</serviceName>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
<aggregation dimName="time" type="joinExisting">
<netcdf 
location="dods://tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2019"/>
<netcdf 
location="dods://tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2020"/>
<netcdf 
location="dods://tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2021"/>
<netcdf 
location="dods://tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2022"/>
<netcdf 
location="dods://tds.hycom.org/thredds/dodsC/GOMu0.04/expt_90.1m000/data/hindcasts/2023"/>
</aggregation>
</netcdf>
</dataset>

<dataset name="(2023) Hindcast Data (1-hrly)"
  ID="GOMu0.04-expt_90.1m000-2023"
urlPath="GOMu0.04/expt_90.1m000/data/hindcasts/2023">
  <serviceName>all</serviceName>
  <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
    <aggregation dimName="time" type="joinExisting" recheckEvery="60 min">
      <scan 
location="/hycom/ftp/datasets/GOMu0.04/expt_90.1m000/data/hindcasts/2023/"
              suffix="*.nc" subdirs="false" />
    </aggregation>
  </netcdf>
</dataset>

<dataset name="(2022) Hindcast Data (1-hrly)"
  ID="GOMu0.04-expt_90.1m000-2022"
urlPath="GOMu0.04/expt_90.1m000/data/hindcasts/2022">
  <serviceName>all</serviceName>
  <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
    <aggregation dimName="time" type="joinExisting">
      <scan 
location="/hycom/ftp/datasets/GOMu0.04/expt_90.1m000/data/hindcasts/2022/"
              suffix="*.nc" subdirs="false" />
    </aggregation>
  </netcdf>
</dataset>

...



We break down our datasets into chunks "by year" and sometimes within
this "by variable". e.g., we have surface data for each time value in
one file (2d) and variables with a depth component in another file
(3z) for datasets. We union these. Running a datasetScan for all the
*2d.nc files in a specific year and another datasetScan for all *3z.nc
files in a dir will each create an independent CACHE record for each.
If you tell THREDDS to not recheck this dataset, then it should honor
this depending on your AggregationCache settings in threddsConfig.xml
and if a joinExisting aggregation has its "recheckEvery=___" defined.

We also enforce a longer cache retention period via this override in
our "threddsConfig.xml". This might not be applicable in your case as
you need to monitor and pay closer attention to these cache files.

  <AggregationCache>
    <dir>/var/lib/tomcat/content/thredds/cache/agg/</dir>
    <scour>-1 sec</scour>
    <maxAge>999 days</maxAge>
    <cachePathPolicy>oneDirectory</cachePathPolicy>
  </AggregationCache>


After the July holiday break (post july 17th) I'd be happy to Zoom
with you one on one to help out further if needed.


On Mon, Jun 19, 2023 at 10:52 AM Jordi Domingo Ballesta
<jordi.domingo@lobelia.earth> wrote:
>
> Dear TDS team,
>
> I would like to know if it is possible to (pre-)create the aggregation cache 
> and make thredds load it, in order to speed up the first time a dataset is 
> requested.
>
> To give a bit of context, our situation is the following:
> - We have a big archive of 265TB of data and 5 million files, distributed in 
> 1000 datasets (aprox).
> - These datasets are in NetCDF format (mostly v4, some v3).
> - We run TDS version 5.4.
> - We configured thredds to provide access to them via "http" and "odap" 
> services, both directly (with "datasetScan") and as aggregated datasets.
> - The configuration needs to be updated regularly (at least every day) as new 
> files come while others are deleted.
> - We have serious performance issues regarding the access of aggregated 
> datasets, especially the first time they are accessed.
>
> In order to improve that, we tried configuring the catalogs with the explicit 
> list of files for each dataset, including the "ncoords" field, or even the 
> "coordValue" field with the time value of each file (they are joinExisting 
> aggregations based on time dimension). That improved substantially the 
> performance of the first access, but the duration is still not "acceptable" 
> by the users.
>
> I tried to pre-create the cache files in thredds/cache/aggNew/ directory with 
> the same content as when they are created by thredds, but it seems that 
> thredds is ignoring them when loading, and just recreating its own version 
> again. I also noticed that the cache database in thredds/cache/catalog/ 
> directory plays a role as well, but I do not understand the relation between 
> that and the aggregation cache files.
>
> Anyway, do you recommend any practice in order to improve the performance of 
> thredds for the first time a dataset is accessed? Maybe throwing a 1-time 
> request for the time variables to each dataset in order to force thredds to 
> create and load the cache?
>
> Your help is very appreciated. Many thanks!
>
> Kind regards,
>
> Jordi Domingo
> Senior software engineer
> Lobelia Earth, S.L.
> _______________________________________________
> NOTE: All exchanges posted to Unidata maintained email lists are
> recorded in the Unidata inquiry tracking system and made publicly
> available through the web.  Users who post to any of the lists we
> maintain are reminded to remove any personal information that they
> do not want to be made public.
>
>
> thredds mailing list
> thredds@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit: 
> https://www.unidata.ucar.edu/mailing_lists/



--
Michael McDonald
Florida State University
Follow-Ups:
- Re: [thredds] How to create / load aggregation cache
  - From: Jordi Domingo Ballesta
References:
- [thredds] How to create / load aggregation cache
  - From: Jordi Domingo Ballesta