Re: [netcdfgroup] Alternate chunking specification

To: Dave Allured - NOAA Affiliate <dave.allured@xxxxxxxx>
Subject: Re: [netcdfgroup] Alternate chunking specification
From: Ed Hartnett <edwardjameshartnett@xxxxxxxxx>
Date: Mon, 22 May 2017 09:08:36 -0600
Howdy Dennis and All!

I applaud the effort to improve user-specification of chunking. It is a
topic that causes confusion to many users, in my experience.

However I'm not sure I understand your algorithm.

If I have dimids 0, 1, 2, which have dimlens NC_UNLIMITED, 30, and 50, and
I indicate that I want c = 1 (0-based index, as God and K&R intended), then
would I get chunksizes 1, 30, 50?

Thanks,
Ed Hartnett



On Tue, May 16, 2017 at 4:31 PM, Dave Allured - NOAA Affiliate <
dave.allured@xxxxxxxx> wrote:

> Okay, it sounds like you are NOT proposing any changes to the netcdf-4
> file format, or to the existing API functions.  Good.
>
> You just asked for use cases DIFFERENT THAN 1,1,...1,di,dj,...dm.  Here is
> one.
>
> My local agency's data portal has gridded data sets that are normally
> dimensioned (time, lat, lon) or (time, level, lat, lon).  These are chunked
> 1,d2,d3 or 1,1,d3,d4 for normal access, which is very popular.  These use
> cases could be served by your proposed alternate chunk spec method.
>
> However, some of our on-line applications serve long time series for
> single grid points.  The normal chunking schemes like 1,d2,d3 prove to be
> unacceptably slow for grid point time series.  "Compromise" chunking
> schemes were tested, and they did not seem to perform well enough.
>
> So we created mirror data sets which are chunked optimally for reading
> single grid points, e.g. (d1,1,1).  These perform very well in live
> operation, and we think that the double storage is worthwhile.
>
> This is almost the same use case as Heiko Klein's second one, "b)
> timeseries strategy".
>
> --Dave A.
> NOAA/OAR/ESRL/PSD/CIRES
> Boulder, Colorado
>
>
> On Tue, May 16, 2017 at 1:56 PM, dmh@xxxxxxxx <dmh@xxxxxxxx> wrote:
>
>> Note that I am proposing an second way to specify chunking on a variable.
>> I am not proposing to remove any existing functionality.
>>
>> But let me restate my question.
>> Question: what are some good use cases for having a chunking spec
>> that is different than
>>     1,1,...1,di,dj,...dm
>> where di is the full size of the ith dimension of the variable.
>> Heiko Klein has given a couple of good use cases, and I am looking for
>> more.
>> =Dennis
>>
>>
>> On 5/16/2017 1:30 PM, Dave Allured - NOAA Affiliate wrote:
>>
>>> Dennis,
>>>
>>> Are you saying that the original function nc_def_var_chunking will be
>>> kept intact, and there will be a new function that will simplify chunk
>>> setting for some data scenarios?  You are not proposing any changes in the
>>> netcdf-4 file format?
>>>
>>> --Dave
>>>
>>>
>>> On Mon, May 15, 2017 at 1:29 PM, dmh@xxxxxxxx <mailto:dmh@xxxxxxxx> <
>>> dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>> wrote:
>>>
>>>     I am soliciting opinions about an alternate way to specify chunking
>>>     for netcdf files. If you are not familiar with chunking, then
>>>     you probably can ignore this message.
>>>
>>>     Currently, one species a per-dimension decomposition that
>>>     together determine how a the data for a variable is decomposed
>>>     into chunks. So e.g. if I have variable (pardon the shorthand
>>> notation)
>>>        x[d1=8,d2=12]
>>>     and I say d1 is chunked 4 and d2 is chunked 4, then x will be
>>> decomposed
>>>     into 6 chunks (8/4 * 12/4).
>>>
>>>     I am proposing this alternate. Suppose we have
>>>          x[d1,d2,...dm]
>>>     And we specify a position 1<=c<m
>>>     Then the idea is that we create chunks of size
>>>         d(c+1) * d(c+2) *...dm
>>>     There will be d1*d2*...dc such chunks.
>>>     In other words, we split the set of dimensions at some point (c)
>>>     and create the chunks based on that split.
>>>
>>>     The claim is that for many situations, the leftmost dimensions
>>>     are what we want to iterate over: e.g. time; and we then want
>>>     to read all of the rest of the data associated with that time.
>>>
>>>     So, my question is: is such a style of chunking useful?
>>>
>>>     If this is not clear, let me know and I will try to clarify.
>>>     =Dennis Heimbigner
>>>       Unidata
>>>
>>
> _______________________________________________
> NOTE: All exchanges posted to Unidata maintained email lists are
> recorded in the Unidata inquiry tracking system and made publicly
> available through the web.  Users who post to any of the lists we
> maintain are reminded to remove any personal information that they
> do not want to be made public.
>
>
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit:
> http://www.unidata.ucar.edu/mailing_lists/
>
Follow-Ups:
- Re: [netcdfgroup] Alternate chunking specification
  - From: Roy Mendelssohn - NOAA Federal
References:
- Re: [netcdfgroup] Alternate chunking specification
  - From: Dave Allured - NOAA Affiliate