Re: [netcdfgroup] Alternate chunking specification

To: "dmh@xxxxxxxx" <dmh@xxxxxxxx>, netcdfgroup@xxxxxxxxxxxxxxxx
Subject: Re: [netcdfgroup] Alternate chunking specification
From: Ed Hartnett <edwardjameshartnett@xxxxxxxxx>
Date: Tue, 23 May 2017 15:30:11 -0600
Howdy Dennis!

I think that what you propose is a very natural extension of the default
chunksizes when a record dimension is used. In that case, the record
dimension gets a chunksize of 1, and the other dimensions get chunksizes of
their full extend. So for dimensions time, lat, lon, that allows a
timestep, which is a full lat - lon grid to be one chunk.

What you propose is that instead of the record dimension getting a
chunksize of 1, other dimensions could also. So an array of time, level,
lat, and lon, could still get a chunk of one lat-lon grid, by specifying 1
for the time and level chunksizes.

I think that is a good idea.

On a related note, many users have complained of very poor performance on
files with a chunksize of 1 in the record dimension, when they are using
the data in other ways that reading one lat-lon grid at a time. Naturally,
this is understandable. To even get one value in the level, the entire
lat-lon grid must be read. So perhaps having all the non-1 dimensions use a
chunksize of their fullest extent is not such a good idea.

Keep on NetCDFing!!

Ed


On Tue, May 23, 2017 at 3:22 PM, dmh@xxxxxxxx <dmh@xxxxxxxx> wrote:

> yes
>
> On 5/22/2017 9:08 AM, Ed Hartnett wrote:
>
>> Howdy Dennis and All!
>>
>> I applaud the effort to improve user-specification of chunking. It is a
>> topic that causes confusion to many users, in my experience.
>>
>> However I'm not sure I understand your algorithm.
>>
>> If I have dimids 0, 1, 2, which have dimlens NC_UNLIMITED, 30, and 50,
>> and I indicate that I want c = 1 (0-based index, as God and K&R intended),
>> then would I get chunksizes 1, 30, 50?
>>
>> Thanks,
>> Ed Hartnett
>>
>>
>>
>> On Tue, May 16, 2017 at 4:31 PM, Dave Allured - NOAA Affiliate <
>> dave.allured@xxxxxxxx <mailto:dave.allured@xxxxxxxx>> wrote:
>>
>>     Okay, it sounds like you are NOT proposing any changes to the
>>     netcdf-4 file format, or to the existing API functions.  Good.
>>
>>     You just asked for use cases DIFFERENT THAN 1,1,...1,di,dj,...dm.
>>  Here is one.
>>
>>     My local agency's data portal has gridded data sets that are
>>     normally dimensioned (time, lat, lon) or (time, level, lat, lon).
>>  These are chunked 1,d2,d3 or 1,1,d3,d4 for normal access, which is
>>     very popular.  These use cases could be served by your proposed
>>     alternate chunk spec method.
>>
>>     However, some of our on-line applications serve long time series for
>>     single grid points.  The normal chunking schemes like 1,d2,d3 prove
>>     to be unacceptably slow for grid point time series.  "Compromise"
>>     chunking schemes were tested, and they did not seem to perform well
>>     enough.
>>
>>     So we created mirror data sets which are chunked optimally for
>>     reading single grid points, e.g. (d1,1,1).  These perform very well
>>     in live operation, and we think that the double storage is worthwhile.
>>
>>     This is almost the same use case as Heiko Klein's second one, "b)
>>     timeseries strategy".
>>
>>     --Dave A.
>>     NOAA/OAR/ESRL/PSD/CIRES
>>     Boulder, Colorado
>>
>>
>>     On Tue, May 16, 2017 at 1:56 PM, dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>
>>     <dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>> wrote:
>>
>>         Note that I am proposing an second way to specify chunking on a
>>         variable. I am not proposing to remove any existing functionality.
>>
>>         But let me restate my question.
>>         Question: what are some good use cases for having a chunking spec
>>         that is different than
>>              1,1,...1,di,dj,...dm
>>         where di is the full size of the ith dimension of the variable.
>>         Heiko Klein has given a couple of good use cases, and I am
>>         looking for
>>         more.
>>         =Dennis
>>
>>
>>         On 5/16/2017 1:30 PM, Dave Allured - NOAA Affiliate wrote:
>>
>>             Dennis,
>>
>>             Are you saying that the original function
>>             nc_def_var_chunking will be kept intact, and there will be a
>>             new function that will simplify chunk setting for some data
>>             scenarios?  You are not proposing any changes in the
>>             netcdf-4 file format?
>>
>>             --Dave
>>
>>
>>             On Mon, May 15, 2017 at 1:29 PM, dmh@xxxxxxxx
>>             <mailto:dmh@xxxxxxxx> <mailto:dmh@xxxxxxxx
>>             <mailto:dmh@xxxxxxxx>> <dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>
>>             <mailto:dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>>> wrote:
>>
>>                  I am soliciting opinions about an alternate way to
>>             specify chunking
>>                  for netcdf files. If you are not familiar with
>>             chunking, then
>>                  you probably can ignore this message.
>>
>>                  Currently, one species a per-dimension decomposition that
>>                  together determine how a the data for a variable is
>>             decomposed
>>                  into chunks. So e.g. if I have variable (pardon the
>>             shorthand notation)
>>                     x[d1=8,d2=12]
>>                  and I say d1 is chunked 4 and d2 is chunked 4, then x
>>             will be decomposed
>>                  into 6 chunks (8/4 * 12/4).
>>
>>                  I am proposing this alternate. Suppose we have
>>                       x[d1,d2,...dm]
>>                  And we specify a position 1<=c<m
>>                  Then the idea is that we create chunks of size
>>                      d(c+1) * d(c+2) *...dm
>>                  There will be d1*d2*...dc such chunks.
>>                  In other words, we split the set of dimensions at some
>>             point (c)
>>                  and create the chunks based on that split.
>>
>>                  The claim is that for many situations, the leftmost
>>             dimensions
>>                  are what we want to iterate over: e.g. time; and we
>>             then want
>>                  to read all of the rest of the data associated with
>>             that time.
>>
>>                  So, my question is: is such a style of chunking useful?
>>
>>                  If this is not clear, let me know and I will try to
>>             clarify.
>>                  =Dennis Heimbigner
>>                    Unidata
>>
>>
>>     _______________________________________________
>>     NOTE: All exchanges posted to Unidata maintained email lists are
>>     recorded in the Unidata inquiry tracking system and made publicly
>>     available through the web.  Users who post to any of the lists we
>>     maintain are reminded to remove any personal information that they
>>     do not want to be made public.
>>
>>
>>     netcdfgroup mailing list
>>     netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
>>     For list information or to unsubscribe,  visit:
>>     http://www.unidata.ucar.edu/mailing_lists/
>>     <http://www.unidata.ucar.edu/mailing_lists/>
>>
>>
>>
>>
>> _______________________________________________
>> NOTE: All exchanges posted to Unidata maintained email lists are
>> recorded in the Unidata inquiry tracking system and made publicly
>> available through the web.  Users who post to any of the lists we
>> maintain are reminded to remove any personal information that they
>> do not want to be made public.
>>
>>
>> netcdfgroup mailing list
>> netcdfgroup@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit:
>> http://www.unidata.ucar.edu/mailing_lists/
>>
>>
Follow-Ups:
- Re: [netcdfgroup] Alternate chunking specification
  - From: Chris Barker
References:
- Re: [netcdfgroup] Alternate chunking specification
  - From: Dave Allured - NOAA Affiliate
- Re: [netcdfgroup] Alternate chunking specification
  - From: Ed Hartnett