Re: [netcdfgroup] Alternate chunking specification

To: netcdfgroup@xxxxxxxxxxxxxxxx
Subject: Re: [netcdfgroup] Alternate chunking specification
From: Dave Allured - NOAA Affiliate <dave.allured@xxxxxxxx>
Date: Tue, 16 May 2017 16:31:22 -0600
Okay, it sounds like you are NOT proposing any changes to the netcdf-4 file
format, or to the existing API functions.  Good.

You just asked for use cases DIFFERENT THAN 1,1,...1,di,dj,...dm.  Here is
one.

My local agency's data portal has gridded data sets that are normally
dimensioned (time, lat, lon) or (time, level, lat, lon).  These are chunked
1,d2,d3 or 1,1,d3,d4 for normal access, which is very popular.  These use
cases could be served by your proposed alternate chunk spec method.

However, some of our on-line applications serve long time series for single
grid points.  The normal chunking schemes like 1,d2,d3 prove to be
unacceptably slow for grid point time series.  "Compromise" chunking
schemes were tested, and they did not seem to perform well enough.

So we created mirror data sets which are chunked optimally for reading
single grid points, e.g. (d1,1,1).  These perform very well in live
operation, and we think that the double storage is worthwhile.

This is almost the same use case as Heiko Klein's second one, "b)
timeseries strategy".

--Dave A.
NOAA/OAR/ESRL/PSD/CIRES
Boulder, Colorado


On Tue, May 16, 2017 at 1:56 PM, dmh@xxxxxxxx <dmh@xxxxxxxx> wrote:

> Note that I am proposing an second way to specify chunking on a variable.
> I am not proposing to remove any existing functionality.
>
> But let me restate my question.
> Question: what are some good use cases for having a chunking spec
> that is different than
>     1,1,...1,di,dj,...dm
> where di is the full size of the ith dimension of the variable.
> Heiko Klein has given a couple of good use cases, and I am looking for
> more.
> =Dennis
>
>
> On 5/16/2017 1:30 PM, Dave Allured - NOAA Affiliate wrote:
>
>> Dennis,
>>
>> Are you saying that the original function nc_def_var_chunking will be
>> kept intact, and there will be a new function that will simplify chunk
>> setting for some data scenarios?  You are not proposing any changes in the
>> netcdf-4 file format?
>>
>> --Dave
>>
>>
>> On Mon, May 15, 2017 at 1:29 PM, dmh@xxxxxxxx <mailto:dmh@xxxxxxxx> <
>> dmh@xxxxxxxx <mailto:dmh@xxxxxxxx>> wrote:
>>
>>     I am soliciting opinions about an alternate way to specify chunking
>>     for netcdf files. If you are not familiar with chunking, then
>>     you probably can ignore this message.
>>
>>     Currently, one species a per-dimension decomposition that
>>     together determine how a the data for a variable is decomposed
>>     into chunks. So e.g. if I have variable (pardon the shorthand
>> notation)
>>        x[d1=8,d2=12]
>>     and I say d1 is chunked 4 and d2 is chunked 4, then x will be
>> decomposed
>>     into 6 chunks (8/4 * 12/4).
>>
>>     I am proposing this alternate. Suppose we have
>>          x[d1,d2,...dm]
>>     And we specify a position 1<=c<m
>>     Then the idea is that we create chunks of size
>>         d(c+1) * d(c+2) *...dm
>>     There will be d1*d2*...dc such chunks.
>>     In other words, we split the set of dimensions at some point (c)
>>     and create the chunks based on that split.
>>
>>     The claim is that for many situations, the leftmost dimensions
>>     are what we want to iterate over: e.g. time; and we then want
>>     to read all of the rest of the data associated with that time.
>>
>>     So, my question is: is such a style of chunking useful?
>>
>>     If this is not clear, let me know and I will try to clarify.
>>     =Dennis Heimbigner
>>       Unidata
>>
>
Follow-Ups:
- Re: [netcdfgroup] Alternate chunking specification
  - From: Ed Hartnett