NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.

Re: HDF5 chunking questions...

NOTE: The netcdf-hdf mailing list is no longer active. The list archives are made available for historical reasons.

Hi Ed,

> Quincey et. al.,
> 
> Given an n-dimensional dataspace, with only one unlimited
> (i.e. extendable) dimension, tell me how to select the chunk size for
> each dimension to get a good read performance for large data files.
> 
> Would you care to suggest any smart algorithms to yeild better
> performance for various situations?
    Unfortunately there aren't generic instructions for this sort of thing,
it's very application-I/O-pattern dependent.  A general heuristic is to pick
lower and upper bounds on the size of a chunk (in bytes) and try to make the
chunks "squarish" (in n-D).  One thing to keep in mind is that the default
chunk cache in HDF5 is 1MB, so it's probably worthwhile to keep chunks under
half of that.  A reasonable lower limit is a small multiple of the block size
of a disk (usually 4KB).
    Generally, you are trying to avoid the situation below:

        Dataset with 10 chunks (dimension sizes don't really matter):
        +----+----+----+----+----+
        |    |    |    |    |    |
        |    |    |    |    |    |
        | A  | B  | C  | D  | E  |
        +----+----+----+----+----+
        |    |    |    |    |    |
        |    |    |    |    |    |
        | F  | G  | H  | I  | J  |
        +----+----+----+----+----+

        If you are writing hyperslabs to part of each chunk like this:
        (hyperslab 1 is in chunk A, hyperslab 2 is in chunk B, etc.)
        +----+----+----+----+----+
        |1111|2222|3333|4444|5555|
        |6666|7777|8888|9999|0000|
        | A  | B  | C  | D  | E  |
        +----+----+----+----+----+
        |    |    |    |    |    |
        |    |    |    |    |    |
        | F  | G  | H  | I  | J  |
        +----+----+----+----+----+

        If the chunk cache is only large enough to hold 4 chunks, then chunk
    A will be preempted from the cache for chunk E (when hyperslab 5 is
    written), but will immediately be re-loaded to write hyperslab 6 out.

    Unfortunately, our general purpose software can't predict the I/O pattern
that users will access the data in, so it is a tough problem.  One the one hand,
you want to keep the chunks small enough that they will stick around in the
cache until they are finished being written/read, but you want the chunks to
be larger so that the I/O on them is more efficient. :-/

    Quincey

  • 2003 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-hdf archives: