Russ:
Thanks, this was interesting. I think you want to change the table
heading in Table 1 from "NCAR IBM P690" to "NCSA IBM P690".
Thanks for correction. We will correct them.
I wonder if the results that show less wall clock time for 6 time
steps than for 4 time steps and similarly for 10 time steps less than
for 8 time steps with pnetCDF on the NCSA P690 might be an indication
of a discretization error in the timing. Or maybe something else was
consuming enough of the machine that the results are unreliable.
I am not sure whether the discretization error in the timing is the reason.
It is possible that
the machine is busy during some runs. The reason we show this figure is for
demonstration that
Parallel NetCDF is worse than Sequential NetCDF with small writes.
But things like parallel file system ,type of platforms, number of
processors, the file layout of the model output as well as
MPI-IO and GPFS will also affect the performance.
I'm also curious why the pnetCDF appears to be so much slower than
serial netCDF for small writes. Do you know what the nature of the
MPI-IO overhead is that could explain what appears to be a 10:1
slowdown for using pnetCDF with 4 time steps on the NCSA P690? I
could understand maybe a 2:1 slowdown, but 10:1 seems surprisingly
large ...
Thanks for pointing out this. As a matter of fact, we may add more contents
to explain this.
I can think the following factors that may be possibly affect the performance:
MPI-IO library, parallel NetCDF implementation, parallel parallel file
system ,type of platforms, number of processors, the file layout of the
model, the domain decomposition of the model. We will write another report
solely for the performance of ROMS with Parallel NetCDF. In that report we
may talk more about these factors.
One important reason I can think of :
As the paper mentioned, there are about 20 1-element netcdf variables
inside ROMS.
All these variables are written in independent IO mode. There are no
corresponding collective IO Parallel NetCDF functions. One strength for
Parallel NetCDF is the collective IO with good "set file view". So through
independent IO to write one element into the NetCDF file is not using any
optimization of Parallel NetCDF. That will, I think, tremendously degrade
the performance.
We may do another study to do further investigate whether that will improve
the performance when we stop writing those variables into NetCDF.
Kent
--Russ