NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
On 04/17/2014 03:22 AM, Alexis Praga wrote:
Hi, I have some questions about parallel netCDF4 (using HDF5, not PnetCDF). I think it's best to just ask them, so please excuse the long list : 1) What is its strategy for parallel I/O ?
i'm not entirely sure what you're asking here. Most parallel I/O libraries carry out I/O to different regions of the file simultaneously (in parallel), and thereby extract more aggregate performance out of the storage system.
for any application using any I/O library, the trickiest part is how to decompose your domain over N parallel processes and how to describe that decomposition.
2) How is it related to HDF5 ? Is it just a wrapper around it ?
in one way of looking, yes. in order to adopt HDF5 as one possible backend, though, the unidata netCDF folks designed a dispatch system so one might write via the classic netCDF interface, via the Argonne-Northwestern Parallel-NetCDF interface, via HDF5, or via DAP.
3) When writing a netCDF4 file, is it really netCDF or is it HDF5 ? ncdump -k returns "netCDF4" but I am not sure.
the new file format is an HDF5 file that can be examined with the broad ecosystem of HDF5 utilities. this hdf5 file, though, has a particular schema or layout that indicates it's a netcdf4 kind of HDF5 file.
4) Is there some documentation online ? I only found that : http://www.unidata.ucar.edu/software/netcdf/docs/netcdf-tutorial/Parallel.html which is very light. 5) Any references (paper or benchmarks) are welcomed. At the moment, I only found the paper by Li et al. (2003) about PnetCDF.
in strict performance terms -- which in the end is not really the be-all end all -- Argonne-Northwestern Parallel-NetCDF will be hard to beat, unless you are working with record variables. The classic netcdf (CDF-1, CDF-2 and CDF-5) file formats are incredibly friendly to parallel I/O, but this friendly layout comes at a cost -- record variables can have only one UNLIMITED dimension, the layout of record variables is sub-optimal for I/O.
HDF5's file format allows for greater flexibility but that flexibility comes at a metadata cost. Once you start operating on large enough datasets and large enough levels of parallelism, the underlying file system becomes the limit on performance.
==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
netcdfgroup
archives: