NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.

Re: [netcdfgroup] Alternate chunking specification

Howdy Dennis!

I think that what you propose is a very natural extension of the default
chunksizes when a record dimension is used. In that case, the record
dimension gets a chunksize of 1, and the other dimensions get chunksizes of
their full extend. So for dimensions time, lat, lon, that allows a
timestep, which is a full lat - lon grid to be one chunk.

What you propose is that instead of the record dimension getting a
chunksize of 1, other dimensions could also. So an array of time, level,
lat, and lon, could still get a chunk of one lat-lon grid, by specifying 1
for the time and level chunksizes.

I think that is a good idea.

On a related note, many users have complained of very poor performance on
files with a chunksize of 1 in the record dimension, when they are using
the data in other ways that reading one lat-lon grid at a time. Naturally,
this is understandable. To even get one value in the level, the entire
lat-lon grid must be read. So perhaps having all the non-1 dimensions use a
chunksize of their fullest extent is not such a good idea.

Keep on NetCDFing!!

Ed


On Tue, May 23, 2017 at 3:22 PM, dmh@xxxxxxxx <dmh@xxxxxxxx> wrote:

> yes
>
> On 5/22/2017 9:08 AM, Ed Hartnett wrote:
>
>> Howdy Dennis and All!
>>
>> I applaud the effort to improve user-specification of chunking. It is a
>> topic that causes confusion to many users, in my experience.
>>
>> However I'm not sure I understand your algorithm.
>>
>> If I have dimids 0, 1, 2, which have dimlens NC_UNLIMITED, 30, and 50,
>> and I indicate that I want c = 1 (0-based index, as God and K&R intended),
>> then would I get chunksizes 1, 30, 50?
>>
>> Thanks,
>> Ed Hartnett
>>
>>
>>
>> On Tue, May 16, 2017 at 4:31 PM, Dave Allured - NOAA Affiliate <
>> dave.allured@xxxxxxxx <mailto:dave.allured@xxxxxxxx>> wrote:
>>
>>     Okay, it sounds like you are NOT proposing any changes to the
>>     netcdf-4 file format, or to the existing API functions.  Good.
>>
>>     You just asked for use cases DIFFERENT THAN 1,1,...1,di,dj,...dm.
>>  Here is one.
>>
>>     My local agency's data portal has gridded data sets that are
>>     normally dimensioned (time, lat, lon) or (time, level, lat, lon).
>>  These are chunked 1,d2,d3 or 1,1,d3,d4 for normal access, which is
>>     very popular.  These use cases could be served by your proposed
>>     alternate chunk spec method.
>>
>>     However, some of our on-line applications serve long time series for
>>     single grid points.  The normal chunking schemes like 1,d2,d3 prove
>>     to be unacceptably slow for grid point time series.  "Compromise"
>>     chunking schemes were tested, and they did not seem to perform well
>>     enough.
>>
>>     So we created mirror data sets which are chunked optimally for
>>     reading single grid points, e.g. (d1,1,1).  These perform very well
>>     in live operation, and we think that the double storage is worthwhile.
>>
>>     This is almost the same use case as Heiko Klein's second one, "b)
>>     timeseries strategy".
>>
>>     --Dave A.
>>     NOAA/OAR/ESRL/PSD/CIRES
>>     Boulder, Colorado
>>
>>
>>     On Tue, May 16, 2017 at 1:56 PM, dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>
>>     <dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>> wrote:
>>
>>         Note that I am proposing an second way to specify chunking on a
>>         variable. I am not proposing to remove any existing functionality.
>>
>>         But let me restate my question.
>>         Question: what are some good use cases for having a chunking spec
>>         that is different than
>>              1,1,...1,di,dj,...dm
>>         where di is the full size of the ith dimension of the variable.
>>         Heiko Klein has given a couple of good use cases, and I am
>>         looking for
>>         more.
>>         =Dennis
>>
>>
>>         On 5/16/2017 1:30 PM, Dave Allured - NOAA Affiliate wrote:
>>
>>             Dennis,
>>
>>             Are you saying that the original function
>>             nc_def_var_chunking will be kept intact, and there will be a
>>             new function that will simplify chunk setting for some data
>>             scenarios?  You are not proposing any changes in the
>>             netcdf-4 file format?
>>
>>             --Dave
>>
>>
>>             On Mon, May 15, 2017 at 1:29 PM, dmh@xxxxxxxx
>>             <mailto:dmh@xxxxxxxx> <mailto:dmh@xxxxxxxx
>>             <mailto:dmh@xxxxxxxx>> <dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>
>>             <mailto:dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>>> wrote:
>>
>>                  I am soliciting opinions about an alternate way to
>>             specify chunking
>>                  for netcdf files. If you are not familiar with
>>             chunking, then
>>                  you probably can ignore this message.
>>
>>                  Currently, one species a per-dimension decomposition that
>>                  together determine how a the data for a variable is
>>             decomposed
>>                  into chunks. So e.g. if I have variable (pardon the
>>             shorthand notation)
>>                     x[d1=8,d2=12]
>>                  and I say d1 is chunked 4 and d2 is chunked 4, then x
>>             will be decomposed
>>                  into 6 chunks (8/4 * 12/4).
>>
>>                  I am proposing this alternate. Suppose we have
>>                       x[d1,d2,...dm]
>>                  And we specify a position 1<=c<m
>>                  Then the idea is that we create chunks of size
>>                      d(c+1) * d(c+2) *...dm
>>                  There will be d1*d2*...dc such chunks.
>>                  In other words, we split the set of dimensions at some
>>             point (c)
>>                  and create the chunks based on that split.
>>
>>                  The claim is that for many situations, the leftmost
>>             dimensions
>>                  are what we want to iterate over: e.g. time; and we
>>             then want
>>                  to read all of the rest of the data associated with
>>             that time.
>>
>>                  So, my question is: is such a style of chunking useful?
>>
>>                  If this is not clear, let me know and I will try to
>>             clarify.
>>                  =Dennis Heimbigner
>>                    Unidata
>>
>>
>>     _______________________________________________
>>     NOTE: All exchanges posted to Unidata maintained email lists are
>>     recorded in the Unidata inquiry tracking system and made publicly
>>     available through the web.  Users who post to any of the lists we
>>     maintain are reminded to remove any personal information that they
>>     do not want to be made public.
>>
>>
>>     netcdfgroup mailing list
>>     netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
>>     For list information or to unsubscribe,  visit:
>>     http://www.unidata.ucar.edu/mailing_lists/
>>     <http://www.unidata.ucar.edu/mailing_lists/>
>>
>>
>>
>>
>> _______________________________________________
>> NOTE: All exchanges posted to Unidata maintained email lists are
>> recorded in the Unidata inquiry tracking system and made publicly
>> available through the web.  Users who post to any of the lists we
>> maintain are reminded to remove any personal information that they
>> do not want to be made public.
>>
>>
>> netcdfgroup mailing list
>> netcdfgroup@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit:
>> http://www.unidata.ucar.edu/mailing_lists/
>>
>>
  • 2017 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: