NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
Hi all, We're having some issues with unlimited dimensions and chunking. First, a few notes: I'm using the netCDF4 python wrappers, and having different symptoms on Windows and Mac, so this could be issues in the py wrappers, or the netcdflib, or the hdf lib, or how one of those is built... If i try to use an unlimted dimension and NOT specify any chunking, I get odd results: On Windows: It takes many times longer to run, and produces a file that is 6 times as big. On OS-X: The mac crashes if I try to use an unlimited dimension and not specify chunking. This page: http://www.unidata.ucar.edu/software/netcdf/docs/default_chunking_4_0_1.html Does indicate that the default is chunksize of 1, which seems insanely small to me, but should at least work. Note: does setting a chunksize of 1 mean that HDF will really use that small of chunks? -- it perusing those HDF docs, it seems it needs to beuild up a tree structure to store where all the chunks are, and there are performance implications to a large tree -- s a chunksize of 1 guarantees a really big tree. Wouldn't a small, but far from 1 value make some sense? like 1k or something? In my experiments with a simple 1-d array, with an unlimited dimension, writing a MB at a time, dropping the chunksize below about 512MB started to effect write performance. Very small chunks really made it crawl. And explicitly setting size-1 chunks made it crash (on OS-X with a malloc error). So I think that explains my problem. With smaller data sets, it works, but runs really slowly -- with a 8MB dataset, going from a chunksize of 1 to a chunksize of 128 reduced write time from 10 seconds to 0.1 second. Increasing to 16k reduces it to about 0.03 seconds -- larger than that makes no noticable difference. So I think I know why I'm having getting problems with unspecified chunksizes, and a chunksize of 1 probably shouldn't be the default! However, if you specify a chunksize, HDF does seem to allocate at least one full chunk in the file -- makes sense, so you wouldn't want to store very small variable with a large chunk size, but I suspect: 1) if you are using an unlimited dimension, you are unlikely to be storing VERY small arrays. 2) netcdf4 seems to have about 8k of overhead anyway. So a 1k or so sized default seems reasonable. One last note: >From experimenting, it appears that you set chunksizes in numbers of elements rather than number of bytes. Is that the case, I haven't been able to find it documented anywhere. Thanks, -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@xxxxxxxx
netcdfgroup
archives: