Re: [netcdfgroup] unlimited dimensions and chunking??

To: Dave Allured - NOAA Affiliate <dave.allured@xxxxxxxx>
Subject: Re: [netcdfgroup] unlimited dimensions and chunking??
From: Chris Barker <chris.barker@xxxxxxxx>
Date: Tue, 31 Dec 2013 13:13:08 -0800
On Fri, Dec 27, 2013 at 5:15 PM, Dave Allured - NOAA Affiliate <
dave.allured@xxxxxxxx> wrote:

> FWIW, here is a chunk size recipe that works well for some rather large
> gridded files that I work with.
>

Thanks Dave.

ncdump -hst uwnd.1979-2012.nc ...
>  level = 37 ; lat = 256 ; lon = 512 ; time = UNLIMITED ; // (49676
> currently)
>

so one unlimited dimension.

 float time(time) ; time:_Storage = "chunked" ; time:_ChunkSizes = 16384 ;
>

This is the 1-d array -- similar to our case. How did you come up with the
16384 (2^14)? Is there a benefit to base-2 numbers here -- I tend to do
that, too, but I'm not sure why.


> float uwnd(time, level, lat, lon) ; uwnd:_Storage = "chunked" ;
> uwnd:_ChunkSizes = 1, 1, 256, 512
>

So in this case, the unlimited dimension is chunked to 1.

This  makes clear what I suspected -- the default is set to 1, and
users haven't noticed a problem, because it does, in fact, make sense for
larger dimensional arrays -- i.e. one time step er chunk, or, in this case,
one time step and one level per chunk.

But a chunk size of 1 for a 1-d (or small 2-d) variable is really, really
bad!


> This scheme depends on good chunk cacheing with adequate buffers for both
> read and write.
>

still not sure where to fo there, it seems optimal chunking is dependent on
both your reading and writing patterns, AND hardware, so literally
impossible to have a generic optimum, and difficult in some cases to have
any idea.

But my minimum tests have indicated that performance isn't all that
sensitive to chunk sizes within a wide range.


>  I think it is a good idea to design chunking on a per-variable basis, not
> per-dimension. Think of chunks as small hyperslabs, not dimension steps.
>

exactly.

Note in particular the successful use of two very different chunk numbers
> in two different variables on the unlimited time dimension.
>

 right -- that's a good point.

So, a proposal:

The default chunking needs to take into account all the dimensions of the
 variable. I'd propose something like:

Starting from the right-most dimension (fastest varying, yes?), have the
chunk size be equal to the dimension if it's not unlimited.

When you get to an unlimited dimension, have its chunk be whatever is
needed to add up to a defined "decent sized" chunk -- I'd say maybe 1k, but
someone other than me might have a better idea for a default. Any other
unlimited dimensions would get a 1 chunk size.

So: for a 1-d unlimited variable, you'd get a chunk size of 1k (1024)

For a small 2-d variable, say (unlimited, 2), you'd get a 2 chunk size for
the second dimension, and 512 for the first.

For Dave's example above, you'd get just what he used for the 4-d variable.

You'd also need a max total chunk size as well to cap it off -- I think the
lib has that already, though it looks like it uses the limits on a
per-dimension basis, rather than a per-whole-chunk basis.

-Chris










> I do not have answers for your specific questions right now, hopefully
> someone else will respond.
>
> --Dave
>
> On Fri, Dec 27, 2013 at 2:15 PM, Chris Barker <chris.barker@xxxxxxxx>wrote:
>
>> Hi all,
>>
>> We're having some issues with unlimited dimensions and chunking. First, a
>> few notes:
>>
>> I'm using the netCDF4 python wrappers, and having different symptoms on
>> Windows and Mac, so this could be issues in the py wrappers, or the
>> netcdflib, or the hdf lib, or how one of those is built...
>>
>> If i try to use an unlimted dimension and NOT specify any chunking, I get
>> odd results:
>>
>> On Windows:
>>   It takes many times longer to run, and produces a file that is 6 times
>> as big.
>>
>> On OS-X:
>>   The mac crashes if I try to use an unlimited dimension and not specify
>> chunking.
>>
>> This page:
>>
>>
>> http://www.unidata.ucar.edu/software/netcdf/docs/default_chunking_4_0_1.html
>>
>> Does indicate that the default is chunksize of 1, which seems insanely
>> small to me, but should at least work. Note: does setting a chunksize of 1
>> mean that HDF will really use that small of chunks? -- it perusing those
>> HDF docs, it seems it needs to beuild up a tree structure to
>> store where all the chunks are, and there are performance implications to a
>> large tree -- s a chunksize of 1 guarantees a really big tree. Wouldn't a
>> small, but far from 1 value make some sense? like 1k or something?
>>
>> In my experiments with a simple 1-d array, with an unlimited dimension,
>> writing a MB at a time, dropping the chunksize below about 512MB started to
>> effect write performance.
>>
>> Very small chunks really made it crawl.
>>
>> And explicitly setting size-1 chunks made it crash (on OS-X with a malloc
>> error). So I think that explains my problem.
>>
>> With smaller data sets, it works, but runs really slowly -- with a 8MB
>> dataset, going from a chunksize of 1 to a chunksize of 128  reduced write
>> time from 10 seconds to 0.1 second.
>>
>> Increasing to 16k reduces it to about 0.03 seconds -- larger than that
>> makes no noticable difference.
>>
>> So I think I know why I'm having getting problems with unspecified
>> chunksizes, and a chunksize of 1 probably shouldn't be the default!
>>
>> However, if you specify a chunksize, HDF does seem to allocate at least
>> one full chunk in the file -- makes sense, so you wouldn't want to store
>> very small variable with a large chunk size, but I suspect:
>>
>> 1) if you are using an unlimited dimension, you are unlikely to be
>> storing VERY small arrays.
>>
>> 2) netcdf4 seems to have about 8k of overhead anyway.
>>
>> So a 1k or so sized default seems reasonable.
>>
>> One last note:
>>
>> From experimenting, it appears that you set chunksizes in numbers
>> of elements  rather than number of bytes. Is that the case, I haven't been
>> able to find it documented anywhere.
>>
>> Thanks,
>>    -Chris
>>
>> --
>>
>> Christopher Barker, Ph.D.
>> Oceanographer
>>
>> Emergency Response Division
>> NOAA/NOS/OR&R            (206) 526-6959   voice
>> 7600 Sand Point Way NE   (206) 526-6329   fax
>> Seattle, WA  98115       (206) 526-6317   main reception
>>
>> Chris.Barker@xxxxxxxx
>>
>> _______________________________________________
>> netcdfgroup mailing list
>> netcdfgroup@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit:
>> http://www.unidata.ucar.edu/mailing_lists/
>>
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@xxxxxxxx
Follow-Ups:
- Re: [netcdfgroup] unlimited dimensions and chunking??
  - From: Russ Rew
References:
- [netcdfgroup] unlimited dimensions and chunking??
  - From: Chris Barker
- Re: [netcdfgroup] unlimited dimensions and chunking??
  - From: Dave Allured - NOAA Affiliate