Re: [netcdfgroup] Alternate chunking specification

To: "dmh@xxxxxxxx" <dmh@xxxxxxxx>, netcdfgroup@xxxxxxxxxxxxxxxx
Subject: Re: [netcdfgroup] Alternate chunking specification
From: Heiko Klein <Heiko.Klein@xxxxxx>
Date: Tue, 16 May 2017 10:28:03 +0200
Hi Dennis,

I agree with you that your proposed slicing strategy is what we most
often use. Since the input-data often is weather-data and therefore
grib-related, the 'grib-strategy' with c=m-1, i.e. 2-d arrays is our
default.

We have a few exceptions to this strategy.
a) global high-resolution datasets, tiling strategy
Since most of our reads are only interested in our region, we chunk the
world into a few (4x3 or 5x2) tiles, which gives us usually 4x faster IO
on gzipped chunks.

b) timeseries strategy (this is not operational, just testing)
For serving point-timeseries of weather data to the public, we rechunk
the files to 2x2 or 4x4 tiles in x/y direction, and make the time-chunk
as large possible.

In most cases, we don't chunk per variable but per dimension.

About the usefulness of your approach: It is not as flexible as the old
approach, so point a) and b) aren't covered. It would be a nice
simplification if one could easily set a chunking strategy like in
netcdf-java, e.g. GRIB_CHUNK_STRATEGY, or,
"COMPLETE_RIGHT_DIMENSIONS_CHUNK_STRATEGY, 2". I prefer to set c from
the right, rather than the left, since I often have (time,y,x),
(time,z,y,x) and (time,ensemble,z,y,x) variables in the same file and
it's the rightmost part which is the same.

Best regards,

Heiko


On 2017-05-15 21:29, dmh@xxxxxxxx wrote:
> I am soliciting opinions about an alternate way to specify chunking
> for netcdf files. If you are not familiar with chunking, then
> you probably can ignore this message.
> 
> Currently, one species a per-dimension decomposition that
> together determine how a the data for a variable is decomposed
> into chunks. So e.g. if I have variable (pardon the shorthand notation)
>   x[d1=8,d2=12]
> and I say d1 is chunked 4 and d2 is chunked 4, then x will be decomposed
> into 6 chunks (8/4 * 12/4).
> 
> I am proposing this alternate. Suppose we have
>     x[d1,d2,...dm]
> And we specify a position 1<=c<m
> Then the idea is that we create chunks of size
>    d(c+1) * d(c+2) *...dm
> There will be d1*d2*...dc such chunks.
> In other words, we split the set of dimensions at some point (c)
> and create the chunks based on that split.
> 
> The claim is that for many situations, the leftmost dimensions
> are what we want to iterate over: e.g. time; and we then want
> to read all of the rest of the data associated with that time.
> 
> So, my question is: is such a style of chunking useful?
> 
> If this is not clear, let me know and I will try to clarify.
> =Dennis Heimbigner
>  Unidata
> 
> 
> 
> _______________________________________________
> NOTE: All exchanges posted to Unidata maintained email lists are
> recorded in the Unidata inquiry tracking system and made publicly
> available through the web.  Users who post to any of the lists we
> maintain are reminded to remove any personal information that they
> do not want to be made public.
> 
> 
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit:
> http://www.unidata.ucar.edu/mailing_lists/

-- 
Dr. Heiko Klein                   Norwegian Meteorological Institute
Tel. + 47 22 96 32 58             P.O. Box 43 Blindern
http://www.met.no                 0313 Oslo NORWAY
References:
- [netcdfgroup] Alternate chunking specification
  - From: dmh@xxxxxxxx