Re: [netcdf-java] NetCDF File and Variable Data Caching

To: Christian Ward-Garrison <cwardgar@xxxxxxxx>
Subject: Re: [netcdf-java] NetCDF File and Variable Data Caching
From: James Gardner <james.gardner@xxxxxxxx>
Date: Tue, 11 Jul 2017 13:59:56 -0500


Hi Christian,

1) Regarding variable max cache size, thank you for the (speedy)accommodation; it would likely solve my problem. However I can't helpwonder if this is best for the library. Use of variable (non-final)statics effectively means employing globals. That's going to make itmuch more difficult to reason about their state at a given moment.Before they were hardcoded and final, so their value could be reliedupon. Now it will be subject to change by any thread at any time, andnew values will affect all default-using Variable instances everywherein the app, regardless of which file they are associated with.I imagine it would be useful to be able to change max size on aper-file-basis. Is there a point in the code where Variable objects arebeing attached to the NetcdfFile object, where the limit could be set,to a value previously specified on that NetcdfFile object? That way onewould have the opportunity to set a value appropriate to a particularnetcdf file, without the above perils. The static default value couldthen remain constant, as-is.

Let me know what you think.

2) Regarding each NetcdfDataset having it's own set of cached Variabledata: it seems a consequence of this would be that in a multi-threadedenvironment like a web server (as in our use-case), where manyNetcdfDataset objects would (by necessity) be 'checked out' (in use byvarious threads) at any given time, there would then, after a period ofuse during which the caches would populate, end up being multiple copiesof each cached Variable in memory.For example, at the moment we are putting each of our netcdf filesentirely in memory, and as you might imagine that uses up an immenseamount of memory and of course is the most simplistic 'caching' solutionpossible. As it turns out, we only actually use about 1/10th of thevariables in each netcdf file in question. So a lazy caching solution,such as netcdf-java lib's, should be a distinct advantage, which is whyI was pursuing it, allowing us to reduce the amount of memory requiredto very roughly 1/10 of current.However, since each Netcdfdataset has its own copy of the same cachedVariable data, not only does it negate this advantage, it could actuallyrequire vastly more memory since the number of request-servicing threadswould easily exceed 10.Can you again validate that this is actually the situation we are facingand that I'm not missing anything here?


Many thanks!
James


On 07/10/2017 04:33 PM, Christian Ward-Garrison wrote:

Hi James,

> is there a central place where one can change the max size ofvariable to cache

There are 3, each used for a different purpose [1][2][3].Unfortunately, none of them could be modified by the user, so I pusheda commit [4] that fixes that. I also built a new SNAPSHOT version ofNetCDF-Java that includes the fix [5]. If you'd like to change thosevalues, you should grab that version. If you're using Maven or Gradleto manage the dependencies of your program, follow these instructions[6], but pull from the "unidata-snapshots" repository.


> Can you verify whether or not this is the case?

It is indeed the case. The data are cached on the Variable objects,which are not shared among NetcdfDatasets.


Cheers,
Christian

[1]https://github.com/cwardgar/thredds/blob/a88db4af71bac2c29429540bc1e5387741be7d68/cdm/src/main/java/ucar/nc2/Variable.java#L69[2]https://github.com/cwardgar/thredds/blob/a88db4af71bac2c29429540bc1e5387741be7d68/cdm/src/main/java/ucar/nc2/Variable.java#L70[3]https://github.com/cwardgar/thredds/blob/a88db4af71bac2c29429540bc1e5387741be7d68/cdm/src/main/java/ucar/nc2/dataset/CoordinateAxis.java#L76[4]https://github.com/cwardgar/thredds/commit/a88db4af71bac2c29429540bc1e5387741be7d68[5]http://artifacts.unidata.ucar.edu/content/repositories/unidata-snapshots/edu/ucar/netcdfAll/4.6.11-SNAPSHOT/[6]https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/reference/BuildDependencies.html

On Thu, Jul 6, 2017 at 2:05 PM, James Gardner <james.gardner@xxxxxxxx<mailto:james.gardner@xxxxxxxx>> wrote:



    Hi,

    If I might pick up where my coworker left off; I need to know a
    couple of additional details about how caching is implemented in
    the netcdf-java lib.

    First, is there a central place where one can change the max size
    of variable to cache, for instance on the netcdfFile/Dataset
    object to which the variable belongs (or even some master
    location), or must one use setSizeToCache on each separate variable?

    My second question is regarding the way/location in which variable
    data is cached.
    If one is using netcdf file caching, it seems that all
    netcdfDataset objects in the cache/pool which represent a given
    actual netcdf file, each have their own cached variable data. In
    other words, cached variable data is not shared between acquired
    netcdfDataset objects.
    I believe I have seen this reflected in the output of (the cache
    status section of) getDetailInfo() when called on various such
    netcdfDataset objects during runtime; some seem to reflect the
    custom setSizeToCache changes I have made, and have cached data,
    while others do not, indicating that they have their own copies of
    variables, along with their own cache settings/thresholds.
    Can you verify whether or not this is the case?

    Cheers,
    James


        Hi Kevin,

        I've done a little research and can provide some answers to
        your questions.

        > Then, next time that NetcdfDataset.acquireDataset() is
        called it causes
        the
        > FileCache.acquireCacheOnly() to return null because the cached
        NetcdfDataset.raf
        > (RandomAccessFile) is null so it makes the lastModified = 0.

        Prior to v4.6.5, this is indeed how caching of NetcdfDataset
        worked. It was
        broken. However, the commit I referenced earlier should've
        fixed that.

        > What does NetcdfDataset.acquireDataset() actually cache?

        It caches the actual NetcdfDataset object, which is the result
        of parsing a
        dataset's metadata to form a hierarchical structure and then
        optionally
        "enhancing" that structure.  Typical enhancements include
        construction of
        coordinate systems. These objects are heavyweight and
        non-trivial to
        create, so only making them once is a huge performance win,
        especially if
        the dataset aggregates smaller datasets.

        > Can I avoid having to do a Variable.read() for every request?
        > Shouldn't this data be cached inside of the netcdf file.

        No, you can't avoid calling Variable.read(). However, if a
        variable is
        small enough its data will be cached automatically [1]. It
        looks like the
        limits are 4,000 bytes for normal variables and 40,000 bytes
        for coordinate
        variables, though you could set different limits by calling
        Variable.setSizeToCache(). Alternatively, you could just
        explicitly cache
        the data yourself by calling Variable.setCachedData().

        > Should I be using those caching options and just storing
        those Variable
        objects
        > in memory in my own cache instead.

        With the recent caching fix, you shouldn't need to hold on to
        the Variable
        objects yourself. NetcdfDatasets will be cached, including the
        Variables
        that they contain.

        > Would it be a better option to use NetcdfFile.openInMemory().

        You could try that, especially if hardware resources are no
        object. I'd be
        interested in the results. Actually, I'd be interested in any
        performance
        data you collect as you optimize your response times.

        Just be aware that opening a file using the static methods or
        constructors
        in NetcdfFile will mean that enhancements won't be applied to
        it. If you
        need coordinate systems to be built, or calculation of
        scale/offset/missing
        values, you need to open with NetcdfDataset.

        ----

        In your original message, you mentioned you're using
        NetcdfDataset.initNetcdfFileCache(), which caches
        NetcdfDataset objects.
        Another potential performance improvement may come from
        caching the
        underlying RandomAccessFiles, via setGlobalFileCache(). If a
        RandomAccessFile is acquired from the cache rather than
        recreated, this
        saves you from performing an open() system call, as well as
        potentially a
        seek() and fill of its buffer.

        Here [3] are the global caches we run in the TDS. You don't
        need to worry
        about GribCdmIndex unless you're working with GRIB files.

        Cheers,
        Christian


        [1]
        
https://github.com/Unidata/thredds/blob/v4.6.6/cdm/src/main/java/ucar/nc2/Variable.java#L848
        
<https://github.com/Unidata/thredds/blob/v4.6.6/cdm/src/main/java/ucar/nc2/Variable.java#L848>
        [2]
        
https://github.com/Unidata/thredds/blob/v4.6.6/cdm/src/main/java/ucar/nc2/Variable.java#L69
        
<https://github.com/Unidata/thredds/blob/v4.6.6/cdm/src/main/java/ucar/nc2/Variable.java#L69>
        [3]
        
https://github.com/Unidata/thredds/blob/v4.6.6/tds/src/main/java/thredds/server/config/CdmInit.java#L263
        
<https://github.com/Unidata/thredds/blob/v4.6.6/tds/src/main/java/thredds/server/config/CdmInit.java#L263>

        On Wed, Jun 15, 2016 at 4:31 PM, Christian Ward-Garrison
        <cwardgar@xxxxxxxx>
        wrote:

        > Hi Kevin,
        >
        > Sorry for the delay in respondingâI was busy with the
        release of 4.6.6âbut
        > I have some time to work on this issue now. A couple questions:
        >
        > 1. What does your webapp do? It sounds like it takes a
        user-defined subset
        > of the data in a NetCDF file and returns it in JSON format.
        How similar is
        > it to our NetCDF Subset Service (example
        >
        
<http://thredds.ucar.edu/thredds/ncss/grib/NCEP/NAM/Alaska_11km/Best/dataset.html
        
<http://thredds.ucar.edu/thredds/ncss/grib/NCEP/NAM/Alaska_11km/Best/dataset.html>>
        > )?
        > 2. What version of NetCDF-Java are you using. I suspect that
        much of the
        > slowness you're encountering was already fixed
        >
        
<https://github.com/cwardgar/thredds/commit/075e9a819ee10714d53b355481a7cccac88b1fb9#diff-99981060deed76f1a9ddedc4362acd7fL155
        
<https://github.com/cwardgar/thredds/commit/075e9a819ee10714d53b355481a7cccac88b1fb9#diff-99981060deed76f1a9ddedc4362acd7fL155>>
        > in v4.6.5.
        >
        > Cheers,
        > Christian
        >
        > On Wed, Jun 8, 2016 at 4:17 PM, Kevin Off - NOAA Affiliate <
        > kevin.off@xxxxxxxx> wrote:
        >
        >> Hi all,
        >>
        >> I am trying to understand caching when it comes to the file
        and the
        >> actual data. The application that I am working on will
        provide data from
        >> 133 NetCDF files that range in size from 50 MB to 400 MB.
        These are weather
        >> forecast files that contain about 22 variables that we are
        interested in .
        >> Each variable has between 1 and 55 or so time steps as
        dimensions.
        >>
        >> This is a Spring web application running in an embedded
        tomcat instance.
        >> All of the files on disk amount to about 22GB of data.
        >>
        >> When I receive a request I:
        >>
        >>    1. Re-project the lat lon to the dataset's projection
        (Lambert
        >>    Convormal)
        >>    2. Lookup the index of the data from the coordinate variabls
        >>    3. loop through every variable
        >>    4. Perform the Array a = var.read()
        >>    5. Loop through every time step and retrieve the value
        at the
        >>    specified point
        >>    6. Return it all in a JSON document.
        >>
        >> This application needs to be extremely fast. We will be
        serving thousands
        >> of requests per second (in production on a scaled system)
        depending on
        >> weather conditions.
        >>
        >> I have been told that hardware is not an obstacle and that
        I can use as
        >> much memory as I need.
        >> During my coding and debugging I have been able to achieve
        a response
        >> time of about 200ms - 400ms on average (this does not
        include any network
        >> time).
        >> As I add timers to every part of the application I find
        that most of the
        >> time is spent in the Variable.read() function.
        >>
        >> Here is a summary of the the configuration of the app.
        >>
        >> NetcdfDataset.initNetcdfFileCache(100, 200, 0);
        >> NetcdfDataset nc = NetcdfDataset.acquireDataset(filename, null)
        >> for each coverage{
        >>   Variable v = ds.findVariable(name)
        >>   Array d = v.read()
        >>   for each time step {
        >>     value = d.read(time, y, x)
        >>   }
        >> }
        >> nc.close()
        >>
        >> I have several questions.
        >>
        >>    1. I noticed that when the NetcdfDataset.close()
        function is called
        >>    it detects that I am using caching and performs
        releases. This causes the
        >>    IOServiceProvider (AbstractIOServiceProvider).release()
        to be called which
        >>    closes and nulls the RandomAccessFile. Then, next time that
        >>    NetcdfDataset.acquireDataset() is called it causes the
        >>    FileCache.acquireCacheOnly() to return null because the
        cached
        >>    NetcdfDataset.raf (RandomAccessFile) is null so it makes
        the lastModified >> =
        >>    0. Am I missing something or is there no way to reuse
        the NetcdfDataset
        >>    after you call close()?
        >>    2. What does NetcdfDataset.acquireDataset() actually
        cache? Is it
        >>    just the metadata or does it actually read in the data
        to all of the
        >>    variables?
        >>    3. Can I avoid having to do a Variable.read() for every
        request?
        >>    Shouldn't this data be cached inside of the netcdf file.
        >>    4. I see that there are caching functions on the
        Variable object.
        >>    Should I be using those caching options and just storing
        those Variable
        >>    objects in memory in my own cache instead.
        >>    5. Would it be a better option to use
        NetcdfFile.openInMemory().
        >>
        >> I know this is a bit long winded but I just want to make
        sure to explore
        >> all of my options. I have spent a lot of time stepping
        through the ucar
        >> library and have already learned a lot. I just need a
        little guidance
        >> regarding some of the more abstract caching functionality.
        Thanks for your
        >> help.
        >>
        >> --
        >> Kevin Off
        >> Internet Dissemination Group, Kansas City
        >> Shared Infrastructure Services Branch
        >> National Weather Service
        >> Software Engineer / Ace Info Solutions, Inc.
        >> <http://www.aceinfosolutions.com
        <http://www.aceinfosolutions.com>>
        >>
        >> _______________________________________________
        >> NOTE: All exchanges posted to Unidata maintained email
        lists are
        >> recorded in the Unidata inquiry tracking system and made
        publicly
        >> available through the web.  Users who post to any of the
        lists we
        >> maintain are reminded to remove any personal information
        that they
        >> do not want to be made public.
        >>
        >>
        >> netcdf-java mailing list
        >> netcdf-java@xxxxxxxxxxxxxxxx
        >> For list information or to unsubscribe, visit:
        >> http://www.unidata.ucar.edu/mailing_lists/
        <http://www.unidata.ucar.edu/mailing_lists/>
        >>
        >
        >


    _______________________________________________
    NOTE: All exchanges posted to Unidata maintained email lists are
    recorded in the Unidata inquiry tracking system and made publicly
    available through the web.  Users who post to any of the lists we
    maintain are reminded to remove any personal information that they
    do not want to be made public.


    netcdf-java mailing list
    netcdf-java@xxxxxxxxxxxxxxxx <mailto:netcdf-java@xxxxxxxxxxxxxxxx>
    For list information or to unsubscribe, visit:
    http://www.unidata.ucar.edu/mailing_lists/

<http://www.unidata.ucar.edu/mailing_lists/>

Follow-Ups:
- Re: [netcdf-java] NetCDF File and Variable Data Caching
  - From: Christian Ward-Garrison

References:
- Re: [netcdf-java] NetCDF File and Variable Data Caching
  - From: James Gardner
- Re: [netcdf-java] NetCDF File and Variable Data Caching
  - From: Christian Ward-Garrison