NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.

Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue

  • To: Ed Hartnett <ed@xxxxxxxxxxxxxxxx>
  • Subject: Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
  • From: Wei Huang <huangwei@xxxxxxxx>
  • Date: Tue, 20 Sep 2011 11:24:07 -0600
Ed,

See my answer/comments to your email below.

Thanks,

Wei Huang
huangwei@xxxxxxxx
VETS/CISL
National Center for Atmospheric Research
P.O. Box 3000 (1850 Table Mesa Dr.)
Boulder, CO 80307-3000 USA
(303) 497-8924





On Sep 19, 2011, at 4:43 PM, Ed Hartnett wrote:

> Wei Huang <huangwei@xxxxxxxx> writes:
> 
>> Hi, netcdfgroup,
>> 
>> Currently, we are trying to use parallel-enabled NetCDF4. We started with 
>> read/write a 5G file and some computation, we got the following timing (in 
>> wall-clock) on a IBM power machine:
>> Number of Processors Total(seconds)  read(seconds)   Write(seconds)  
>> Computation(seconds)
>> seq                                  89.137          28.206          48.327  
>>         11.717
>> 1                                    178.953         44.837          121.17  
>>         11.644
>> 2                                    167.25          46.571          113.343 
>>         5.648
>> 4                                    168.138         44.043          118.968 
>>         2.729
>> 8                                    137.74          25.161          108.986 
>>         1.064
>> 16                                   113.354         16.359          93.253  
>>         0.494
>> 32                                   439.481         122.201         311.215 
>>         0.274
>> 64                                   831.896         277.363         588.653 
>>         0.203
>> 
>> First thing we can see is that when run parallel-enabled code at one 
>> processor, the total
>> wall-clok time doubled.
>> Then we did not see the scaling when more processors added.
>> 
>> Anyone wants to share their experience?
>> 
>> Thanks,
>> 
>> Wei Huang
>> huangwei@xxxxxxxx
>> VETS/CISL
>> National Center for Atmospheric Research
>> P.O. Box 3000 (1850 Table Mesa Dr.)
>> Boulder, CO 80307-3000 USA
>> (303) 497-8924
>> 
>> 
> 
> Howdy Wei and all!
> 
> Are you using the 4.1.2 release? Did you configure with
> --enable-parallel-tests, and did those tests pass?

I am using 4.1.3, and configured with "-enanle-parallel-tests", and those tests 
passed.

> 
> I would suggest building netCDF with --enable-parallel-tests and then
> running nc_test4/tst_nc4perf. This simple program, based on
> user-contributed test code, performs parallel I/O with a wide variety of
> options, and prints a table of results.

I have run nc_test4/tst_nc4perf for 1, 2, 4, and 8 processors, results attached.

To me, the performance decreases when processors increase.
Someone may have a better interpret.

I also run tst_parallel4, with result:
num_proc   time(s)  write_rate(B/s)
1       9.2015  1.16692e+08
2       12.4557 8.62048e+07
4       6.30644 1.70261e+08
8       5.53761 1.939e+08
16      2.25639 4.75866e+08
32      2.28383 4.7015e+08
64      2.19041 4.90202e+08

> 
> This will tell you whether parallel I/O is working on your platform, and
> at least give some idea of reasonable settings.
> 
> Parallel I/O is a very complex topic. However, if everything is working
> well, you should see I/O improvement which scales reasonably linearly,
> for less then about 8 processors (perhaps more, depending on your
> system, but not much more.) At this point, your parallel application is
> saturating your I/O subsystem, and further I/O performance is
> marginal.
> 
> In general, HDF5 I/O will not be faster than netCDF-4 I/O. The netCDF-4
> layer is very light in this area, and simply calls the HDF5 that the
> user would call anyway.
> 
> Key settings are: 
> 
> * MPI_IO vs. POSIX_IO (varies from platform to platform which is
>  faster. See nc4perf results for your machine/compiler.)

    Tested both, POSIX is better.
> 
> * Chunking and caching play a big role, as always. Caching is
>  turned off by default, otherwise netCDF caches on all the processors
>  will consume too much memory. But you should set this to at least the
>  size of one chunk. Note that this cache will happen on all processors
>  involved.
> 

   We use chunking, can probably try caching.

> * Collective vs. independent access. Seems (to my naive view) like
>  independent should usually be faster, but the opposite seems to be
>  the case. This is because the I/O subsystems are good at grouping I/O
>  requests into larger, more efficient units. Collective access gives
>  the I/O layer the maximum chance to exercise its magic.

    tried both, no significant difference.

> 
> Best thing to do is get tst_nc4perf working on your platform, and then
> modify it to write data files that match yours (i.e. same size
> variables). The program will then tell you the best set of settings to
> use in your case.

   We can modify this program to mimic our data size, but do not know if this 
will help us.

> 
> If the program shows that parallel I/O is not working, take a look at
> the netCDF test program h5_test/tst_h_par.c. This is a HDF5-only program
> (no netcdf code at all) that does parallel I/O. If this program does not
> show that parallel I/O is working, then your problem is not with the
> netCDF layer, but somewhere in HDF5 or even lower in the stack.
> 
> Thanks!
> 
> Ed
> 
> -- 
> Ed Hartnett  -- ed@xxxxxxxxxxxxxxxx

Attachment: mpi_io_bluefire.perf
Description: Binary data

  • 2011 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: