NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
Dear netCDF developers and users, I am writing to ask for advice regarding setting up efficient NetCDF-based parallel I/O in the model I'm working on (PISM, see [1]). This is not a question of *tuning* I/O in a program: I can replace *all* of PISM's I/O code if necessary [2]. So, the question is this: how do I use NetCDF to write 2D and 3D fields described below efficiently and in parallel? Here is an example of a setup I need to be able to handle: a 2640 (X dimension) by 4560 (Y dimension) uniform Cartesian grid [3] with 401 vertical (Z) levels. 2D fields take ~90Mb each and 3D fields -- ~36Gb each. A grid like this is typically distributed (uniformly) over 512 MPI processes, each process getting a ~70Mb portion of a 3D field and ~200Kb per 2D field. During a typical model run PISM writes the full model state (one 3D field and a handful of 2D fields, ~38Gb total) several times (checkpointing plus the final output). In addition to this, the user may choose to write a number [4] of fields at regular intervals throughout the run. It is not unusual to write about 1000 records of each of these fields, appending to the output file. Note that PISM's horizontal (2D) grid is split into rectangular patches, most likely 16 patches in one direction and 32 patches in the other in a 512-core run. (This means that what is contiguous in memory usually is not contiguous in a file even when storage orders match.) Notes: - The records we are writing are too big for the NetCDF-3 file format, so we have to use NetCDF-4 or PNetCDF's CDF-5 format. I would prefer to use NetCDF-4 to simplify post-processing. (Before NetCDF 4.4.0 I would have said "PNetCDF is not an option because most post-processing tools don't support it." I'm happy to see CDF-5 support added to mainstream NetCDF-4. Most post-processing tools still don't support it, though.) - If possible, output files should have one unlimited dimension (time). NetCDF variables should have "time" as the first dimension so PISM's output would fit into the NetCDF's "classic" data model. (We try to follow Climate and Forecasting (CF) conventions.) - During post-processing and analysis variables are accessed one-record-at-a-time, but each record can be stored contiguously. (I have no idea how to pick chunking parameters, though.) - All systems I have access to use Lustre. Thanks in advance for any input you might have! [1]: PISM stands for "Parallel Ice Sheet Model". See www.pism-docs.org for details. [2]: I'm hoping that you don't need the details of our existing implementation to see what's going on. I'm happy to provide such details if necessary, though. [3]: This grid covers all of Greenland with the spatial resolution of 600m. [4]: About 50 in a typical run; these are usually 2D fields. -- Constantine
netcdfgroup
archives: