Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue

To: Ed Hartnett <ed@xxxxxxxxxxxxxxxx>
Subject: Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
From: Wei Huang <huangwei@xxxxxxxx>
Date: Tue, 20 Sep 2011 11:24:07 -0600

Ed,

See my answer/comments to your email below.

Thanks,

Wei Huang
huangwei@xxxxxxxx
VETS/CISL
National Center for Atmospheric Research
P.O. Box 3000 (1850 Table Mesa Dr.)
Boulder, CO 80307-3000 USA
(303) 497-8924





On Sep 19, 2011, at 4:43 PM, Ed Hartnett wrote:

> Wei Huang <huangwei@xxxxxxxx> writes:
> 
>> Hi, netcdfgroup,
>> 
>> Currently, we are trying to use parallel-enabled NetCDF4. We started with 
>> read/write a 5G file and some computation, we got the following timing (in 
>> wall-clock) on a IBM power machine:
>> Number of Processors Total(seconds)  read(seconds)   Write(seconds)  
>> Computation(seconds)
>> seq                                  89.137          28.206          48.327  
>>         11.717
>> 1                                    178.953         44.837          121.17  
>>         11.644
>> 2                                    167.25          46.571          113.343 
>>         5.648
>> 4                                    168.138         44.043          118.968 
>>         2.729
>> 8                                    137.74          25.161          108.986 
>>         1.064
>> 16                                   113.354         16.359          93.253  
>>         0.494
>> 32                                   439.481         122.201         311.215 
>>         0.274
>> 64                                   831.896         277.363         588.653 
>>         0.203
>> 
>> First thing we can see is that when run parallel-enabled code at one 
>> processor, the total
>> wall-clok time doubled.
>> Then we did not see the scaling when more processors added.
>> 
>> Anyone wants to share their experience?
>> 
>> Thanks,
>> 
>> Wei Huang
>> huangwei@xxxxxxxx
>> VETS/CISL
>> National Center for Atmospheric Research
>> P.O. Box 3000 (1850 Table Mesa Dr.)
>> Boulder, CO 80307-3000 USA
>> (303) 497-8924
>> 
>> 
> 
> Howdy Wei and all!
> 
> Are you using the 4.1.2 release? Did you configure with
> --enable-parallel-tests, and did those tests pass?

I am using 4.1.3, and configured with "-enanle-parallel-tests", and those tests 
passed.

> 
> I would suggest building netCDF with --enable-parallel-tests and then
> running nc_test4/tst_nc4perf. This simple program, based on
> user-contributed test code, performs parallel I/O with a wide variety of
> options, and prints a table of results.

I have run nc_test4/tst_nc4perf for 1, 2, 4, and 8 processors, results attached.

To me, the performance decreases when processors increase.
Someone may have a better interpret.

I also run tst_parallel4, with result:
num_proc   time(s)  write_rate(B/s)
1       9.2015  1.16692e+08
2       12.4557 8.62048e+07
4       6.30644 1.70261e+08
8       5.53761 1.939e+08
16      2.25639 4.75866e+08
32      2.28383 4.7015e+08
64      2.19041 4.90202e+08

> 
> This will tell you whether parallel I/O is working on your platform, and
> at least give some idea of reasonable settings.
> 
> Parallel I/O is a very complex topic. However, if everything is working
> well, you should see I/O improvement which scales reasonably linearly,
> for less then about 8 processors (perhaps more, depending on your
> system, but not much more.) At this point, your parallel application is
> saturating your I/O subsystem, and further I/O performance is
> marginal.
> 
> In general, HDF5 I/O will not be faster than netCDF-4 I/O. The netCDF-4
> layer is very light in this area, and simply calls the HDF5 that the
> user would call anyway.
> 
> Key settings are: 
> 
> * MPI_IO vs. POSIX_IO (varies from platform to platform which is
>  faster. See nc4perf results for your machine/compiler.)

    Tested both, POSIX is better.
> 
> * Chunking and caching play a big role, as always. Caching is
>  turned off by default, otherwise netCDF caches on all the processors
>  will consume too much memory. But you should set this to at least the
>  size of one chunk. Note that this cache will happen on all processors
>  involved.
> 

   We use chunking, can probably try caching.

> * Collective vs. independent access. Seems (to my naive view) like
>  independent should usually be faster, but the opposite seems to be
>  the case. This is because the I/O subsystems are good at grouping I/O
>  requests into larger, more efficient units. Collective access gives
>  the I/O layer the maximum chance to exercise its magic.

    tried both, no significant difference.

> 
> Best thing to do is get tst_nc4perf working on your platform, and then
> modify it to write data files that match yours (i.e. same size
> variables). The program will then tell you the best set of settings to
> use in your case.

   We can modify this program to mimic our data size, but do not know if this 
will help us.

> 
> If the program shows that parallel I/O is not working, take a look at
> the netCDF test program h5_test/tst_h_par.c. This is a HDF5-only program
> (no netcdf code at all) that does parallel I/O. If this program does not
> show that parallel I/O is working, then your problem is not with the
> netCDF layer, but somewhere in HDF5 or even lower in the stack.
> 
> Thanks!
> 
> Ed
> 
> -- 
> Ed Hartnett  -- ed@xxxxxxxxxxxxxxxx

Attachment: mpi_io_bluefire.perf
Description: Binary data

Follow-Ups:
- Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
  - From: Ed Hartnett

References:
- [netcdfgroup] Unidata developers blog...
  - From: Ed Hartnett
- [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
  - From: Wei Huang
- Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
  - From: Ed Hartnett

2011 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: