NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.

Re: HDF5 chunking questions...

NOTE: The netcdf-hdf mailing list is no longer active. The list archives are made available for historical reasons.

Hi Ed,

> > >     Unfortunately there aren't generic instructions for this sort of 
> > > thing,
> > >it's very application-I/O-pattern dependent.  A general heuristic is to 
> > >pick
> > >lower and upper bounds on the size of a chunk (in bytes) and try to make 
> > >the
> > >chunks "squarish" (in n-D).  One thing to keep in mind is that the default
> > >chunk cache in HDF5 is 1MB, so it's probably worthwhile to keep chunks 
> > >under
> > >half of that.  A reasonable lower limit is a small multiple of the block 
> > >size
> > >of a disk (usually 4KB).
> 
> 1 MB seems low for scientific applications. Even cheap consumer PCs come
> with about half a gig of RAM. Scientific machines much more
> so. Wouldn't it be helpful to have 100 MB, for example?
    Yes, we've kicked that around, we should bump it up to something more
reasonable in a future release.

> > >     Generally, you are trying to avoid the situation below:
> > >
> > >         Dataset with 10 chunks (dimension sizes don't really matter):
> > >         +----+----+----+----+----+
> > >         |    |    |    |    |    |
> > >         |    |    |    |    |    |
> > >         | A  | B  | C  | D  | E  |
> > >         +----+----+----+----+----+
> > >         |    |    |    |    |    |
> > >         |    |    |    |    |    |
> > >         | F  | G  | H  | I  | J  |
> > >         +----+----+----+----+----+
> > >
> > >         If you are writing hyperslabs to part of each chunk like this:
> > >         (hyperslab 1 is in chunk A, hyperslab 2 is in chunk B, etc.)
> > >         +----+----+----+----+----+
> > >         |1111|2222|3333|4444|5555|
> > >         |6666|7777|8888|9999|0000|
> > >         | A  | B  | C  | D  | E  |
> > >         +----+----+----+----+----+
> > >         |    |    |    |    |    |
> > >         |    |    |    |    |    |
> > >         | F  | G  | H  | I  | J  |
> > >         +----+----+----+----+----+
> > >
> > >         If the chunk cache is only large enough to hold 4 chunks, then 
> > > chunk
> > >     A will be preempted from the cache for chunk E (when hyperslab 5 is
> > >     written), but will immediately be re-loaded to write hyperslab
> > >     6 out.
> 
> OK, great. Let me see if I can start to come up with the rules by
> which I can select chunk sizes:
> 
> 1 - Min chunk size should be 4 KB.
> 2 - Max chunk size should allow n chunks to fit in the chunk cache,
> where n is around the max number of chunks the user will access at
> once in a hyper-slab.
    Generally, yes.

> > >
> > >     Unfortunately, our general purpose software can't predict the I/O 
> > > pattern
> > > that users will access the data in, so it is a tough problem.  One
> > > the one hand,
> > >you want to keep the chunks small enough that they will stick around in the
> > >cache until they are finished being written/read, but you want the chunks 
> > >to
> > >be larger so that the I/O on them is more efficient. :-/
> 
> I think we can make some reasonable guesses for netcdf-3.x access
> patterns, so that we can at least ensure the common tasks are working
> fast enough.
    Cool.

> Obviously any user can flummox our optimizations by doing some odd
> things we don't expect. As my old engineering professors told me: you
> can make it foolproof, but you can't make it damn-foolproof.
    :-)

        Quincey

  • 2003 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-hdf archives: