- To: Ed Hartnett <ed@xxxxxxxxxxxxxxxx>
- Subject: Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
- From: Wei Huang <huangwei@xxxxxxxx>
- Date: Tue, 20 Sep 2011 11:24:07 -0600
Ed,
See my answer/comments to your email below.
Thanks,
Wei Huang
huangwei@xxxxxxxx
VETS/CISL
National Center for Atmospheric Research
P.O. Box 3000 (1850 Table Mesa Dr.)
Boulder, CO 80307-3000 USA
(303) 497-8924
On Sep 19, 2011, at 4:43 PM, Ed Hartnett wrote:
> Wei Huang <huangwei@xxxxxxxx> writes:
>
>> Hi, netcdfgroup,
>>
>> Currently, we are trying to use parallel-enabled NetCDF4. We started with
>> read/write a 5G file and some computation, we got the following timing (in
>> wall-clock) on a IBM power machine:
>> Number of Processors Total(seconds) read(seconds) Write(seconds)
>> Computation(seconds)
>> seq 89.137 28.206 48.327
>> 11.717
>> 1 178.953 44.837 121.17
>> 11.644
>> 2 167.25 46.571 113.343
>> 5.648
>> 4 168.138 44.043 118.968
>> 2.729
>> 8 137.74 25.161 108.986
>> 1.064
>> 16 113.354 16.359 93.253
>> 0.494
>> 32 439.481 122.201 311.215
>> 0.274
>> 64 831.896 277.363 588.653
>> 0.203
>>
>> First thing we can see is that when run parallel-enabled code at one
>> processor, the total
>> wall-clok time doubled.
>> Then we did not see the scaling when more processors added.
>>
>> Anyone wants to share their experience?
>>
>> Thanks,
>>
>> Wei Huang
>> huangwei@xxxxxxxx
>> VETS/CISL
>> National Center for Atmospheric Research
>> P.O. Box 3000 (1850 Table Mesa Dr.)
>> Boulder, CO 80307-3000 USA
>> (303) 497-8924
>>
>>
>
> Howdy Wei and all!
>
> Are you using the 4.1.2 release? Did you configure with
> --enable-parallel-tests, and did those tests pass?
I am using 4.1.3, and configured with "-enanle-parallel-tests", and those tests
passed.
>
> I would suggest building netCDF with --enable-parallel-tests and then
> running nc_test4/tst_nc4perf. This simple program, based on
> user-contributed test code, performs parallel I/O with a wide variety of
> options, and prints a table of results.
I have run nc_test4/tst_nc4perf for 1, 2, 4, and 8 processors, results attached.
To me, the performance decreases when processors increase.
Someone may have a better interpret.
I also run tst_parallel4, with result:
num_proc time(s) write_rate(B/s)
1 9.2015 1.16692e+08
2 12.4557 8.62048e+07
4 6.30644 1.70261e+08
8 5.53761 1.939e+08
16 2.25639 4.75866e+08
32 2.28383 4.7015e+08
64 2.19041 4.90202e+08
>
> This will tell you whether parallel I/O is working on your platform, and
> at least give some idea of reasonable settings.
>
> Parallel I/O is a very complex topic. However, if everything is working
> well, you should see I/O improvement which scales reasonably linearly,
> for less then about 8 processors (perhaps more, depending on your
> system, but not much more.) At this point, your parallel application is
> saturating your I/O subsystem, and further I/O performance is
> marginal.
>
> In general, HDF5 I/O will not be faster than netCDF-4 I/O. The netCDF-4
> layer is very light in this area, and simply calls the HDF5 that the
> user would call anyway.
>
> Key settings are:
>
> * MPI_IO vs. POSIX_IO (varies from platform to platform which is
> faster. See nc4perf results for your machine/compiler.)
Tested both, POSIX is better.
>
> * Chunking and caching play a big role, as always. Caching is
> turned off by default, otherwise netCDF caches on all the processors
> will consume too much memory. But you should set this to at least the
> size of one chunk. Note that this cache will happen on all processors
> involved.
>
We use chunking, can probably try caching.
> * Collective vs. independent access. Seems (to my naive view) like
> independent should usually be faster, but the opposite seems to be
> the case. This is because the I/O subsystems are good at grouping I/O
> requests into larger, more efficient units. Collective access gives
> the I/O layer the maximum chance to exercise its magic.
tried both, no significant difference.
>
> Best thing to do is get tst_nc4perf working on your platform, and then
> modify it to write data files that match yours (i.e. same size
> variables). The program will then tell you the best set of settings to
> use in your case.
We can modify this program to mimic our data size, but do not know if this
will help us.
>
> If the program shows that parallel I/O is not working, take a look at
> the netCDF test program h5_test/tst_h_par.c. This is a HDF5-only program
> (no netcdf code at all) that does parallel I/O. If this program does not
> show that parallel I/O is working, then your problem is not with the
> netCDF layer, but somewhere in HDF5 or even lower in the stack.
>
> Thanks!
>
> Ed
>
> --
> Ed Hartnett -- ed@xxxxxxxxxxxxxxxx
Attachment:
mpi_io_bluefire.perf
Description: Binary data
- Follow-Ups:
- Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
- From: Ed Hartnett
- Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
- References:
- [netcdfgroup] Unidata developers blog...
- From: Ed Hartnett
- [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
- From: Wei Huang
- Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
- From: Ed Hartnett
- [netcdfgroup] Unidata developers blog...