NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
On Fri, Dec 27, 2013 at 5:15 PM, Dave Allured - NOAA Affiliate < dave.allured@xxxxxxxx> wrote: > FWIW, here is a chunk size recipe that works well for some rather large > gridded files that I work with. > Thanks Dave. ncdump -hst uwnd.1979-2012.nc ... > level = 37 ; lat = 256 ; lon = 512 ; time = UNLIMITED ; // (49676 > currently) > so one unlimited dimension. float time(time) ; time:_Storage = "chunked" ; time:_ChunkSizes = 16384 ; > This is the 1-d array -- similar to our case. How did you come up with the 16384 (2^14)? Is there a benefit to base-2 numbers here -- I tend to do that, too, but I'm not sure why. > float uwnd(time, level, lat, lon) ; uwnd:_Storage = "chunked" ; > uwnd:_ChunkSizes = 1, 1, 256, 512 > So in this case, the unlimited dimension is chunked to 1. This makes clear what I suspected -- the default is set to 1, and users haven't noticed a problem, because it does, in fact, make sense for larger dimensional arrays -- i.e. one time step er chunk, or, in this case, one time step and one level per chunk. But a chunk size of 1 for a 1-d (or small 2-d) variable is really, really bad! > This scheme depends on good chunk cacheing with adequate buffers for both > read and write. > still not sure where to fo there, it seems optimal chunking is dependent on both your reading and writing patterns, AND hardware, so literally impossible to have a generic optimum, and difficult in some cases to have any idea. But my minimum tests have indicated that performance isn't all that sensitive to chunk sizes within a wide range. > I think it is a good idea to design chunking on a per-variable basis, not > per-dimension. Think of chunks as small hyperslabs, not dimension steps. > exactly. Note in particular the successful use of two very different chunk numbers > in two different variables on the unlimited time dimension. > right -- that's a good point. So, a proposal: The default chunking needs to take into account all the dimensions of the variable. I'd propose something like: Starting from the right-most dimension (fastest varying, yes?), have the chunk size be equal to the dimension if it's not unlimited. When you get to an unlimited dimension, have its chunk be whatever is needed to add up to a defined "decent sized" chunk -- I'd say maybe 1k, but someone other than me might have a better idea for a default. Any other unlimited dimensions would get a 1 chunk size. So: for a 1-d unlimited variable, you'd get a chunk size of 1k (1024) For a small 2-d variable, say (unlimited, 2), you'd get a 2 chunk size for the second dimension, and 512 for the first. For Dave's example above, you'd get just what he used for the 4-d variable. You'd also need a max total chunk size as well to cap it off -- I think the lib has that already, though it looks like it uses the limits on a per-dimension basis, rather than a per-whole-chunk basis. -Chris > I do not have answers for your specific questions right now, hopefully > someone else will respond. > > --Dave > > On Fri, Dec 27, 2013 at 2:15 PM, Chris Barker <chris.barker@xxxxxxxx>wrote: > >> Hi all, >> >> We're having some issues with unlimited dimensions and chunking. First, a >> few notes: >> >> I'm using the netCDF4 python wrappers, and having different symptoms on >> Windows and Mac, so this could be issues in the py wrappers, or the >> netcdflib, or the hdf lib, or how one of those is built... >> >> If i try to use an unlimted dimension and NOT specify any chunking, I get >> odd results: >> >> On Windows: >> It takes many times longer to run, and produces a file that is 6 times >> as big. >> >> On OS-X: >> The mac crashes if I try to use an unlimited dimension and not specify >> chunking. >> >> This page: >> >> >> http://www.unidata.ucar.edu/software/netcdf/docs/default_chunking_4_0_1.html >> >> Does indicate that the default is chunksize of 1, which seems insanely >> small to me, but should at least work. Note: does setting a chunksize of 1 >> mean that HDF will really use that small of chunks? -- it perusing those >> HDF docs, it seems it needs to beuild up a tree structure to >> store where all the chunks are, and there are performance implications to a >> large tree -- s a chunksize of 1 guarantees a really big tree. Wouldn't a >> small, but far from 1 value make some sense? like 1k or something? >> >> In my experiments with a simple 1-d array, with an unlimited dimension, >> writing a MB at a time, dropping the chunksize below about 512MB started to >> effect write performance. >> >> Very small chunks really made it crawl. >> >> And explicitly setting size-1 chunks made it crash (on OS-X with a malloc >> error). So I think that explains my problem. >> >> With smaller data sets, it works, but runs really slowly -- with a 8MB >> dataset, going from a chunksize of 1 to a chunksize of 128 reduced write >> time from 10 seconds to 0.1 second. >> >> Increasing to 16k reduces it to about 0.03 seconds -- larger than that >> makes no noticable difference. >> >> So I think I know why I'm having getting problems with unspecified >> chunksizes, and a chunksize of 1 probably shouldn't be the default! >> >> However, if you specify a chunksize, HDF does seem to allocate at least >> one full chunk in the file -- makes sense, so you wouldn't want to store >> very small variable with a large chunk size, but I suspect: >> >> 1) if you are using an unlimited dimension, you are unlikely to be >> storing VERY small arrays. >> >> 2) netcdf4 seems to have about 8k of overhead anyway. >> >> So a 1k or so sized default seems reasonable. >> >> One last note: >> >> From experimenting, it appears that you set chunksizes in numbers >> of elements rather than number of bytes. Is that the case, I haven't been >> able to find it documented anywhere. >> >> Thanks, >> -Chris >> >> -- >> >> Christopher Barker, Ph.D. >> Oceanographer >> >> Emergency Response Division >> NOAA/NOS/OR&R (206) 526-6959 voice >> 7600 Sand Point Way NE (206) 526-6329 fax >> Seattle, WA 98115 (206) 526-6317 main reception >> >> Chris.Barker@xxxxxxxx >> >> _______________________________________________ >> netcdfgroup mailing list >> netcdfgroup@xxxxxxxxxxxxxxxx >> For list information or to unsubscribe, visit: >> http://www.unidata.ucar.edu/mailing_lists/ >> > > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@xxxxxxxx
netcdfgroup
archives: