NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.

Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue

  • To: Rob Latham <robl@xxxxxxxxxxx>
  • Subject: Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
  • From: Wei Huang <huangwei@xxxxxxxx>
  • Date: Mon, 19 Sep 2011 11:44:36 -0600
Rob,

Below is the result use pnetcdf (but under netcdf4).

Thanks,

Wei Huang
huangwei@xxxxxxxx
VETS/CISL
National Center for Atmospheric Research
P.O. Box 3000 (1850 Table Mesa Dr.)
Boulder, CO 80307-3000 USA
(303) 497-8924



Number of Processors    Total(seconds)  read(seconds)   Write(seconds)  
Computation(seconds)
seq                                     89.137          28.206          48.327  
        11.717
1                                       89.055          18.190          58.977  
        11.612
2                                       189.892         14.577          168.999 
        5.729
4                                       229.825         24.265          202.11  
        2.585
8                                       263.488         26.528          234.199 
        1.130
16                                      298.131         48.399          247.07  
        0.625
32                                      421.336         63.559          352.373 
        0.484
64                                      549.144         71.947          462.465 
        0.525


On Sep 19, 2011, at 11:36 AM, Rob Latham wrote:

> On Mon, Sep 19, 2011 at 11:09:23AM -0600, Wei Huang wrote:
>> Jim,
>> 
>> I am using the gpfs filesystem, but did not set any MPI-IO hints.
>> I did not do processor binding, but I guess binding could help if
>> less processors used on a node.
>> I am actually using NC_MPIPOSIX, rather than NC_MPIIO as the later will give
>> even worse timing.
>> 
>> The 5G file has 170 variables, with some of them have size:
>> [ 1 <time | unlimited>, 27 <ilev>, 768 <lat>, 1152 <lon> ]
>> and used chunk size (1, 1, 192, 288).
>> 
>> The last part more like a netcdf developers work.
> 
> Perhaps you can make the netcdf developers' job a bit easier by
> providing a test case.  If the dataset contains 170 variables, then it
> must be part of some larger program and so might be hard to extract.
> 
> I'll be honest: I'm mostly curious how pnetcdf handles this workload
> (my guess as a pnetcdf developer is "poorly" because of the record
> variable i/o).  Still, the test case will help the netcdf, hdf5, and
> MPI-IO developers...
> 
> ==rob
> 
>> On Sep 19, 2011, at 10:48 AM, Jim Edwards wrote:
>> 
>>> Hi Wei,
>>> 
>>> 
>>> Are you using the gpfs filesystem and are you setting any MPI-IO hints for 
>>> that filesystem?
>>> 
>>> Are you using any processor binding technique?   Have you experimented with 
>>> other settings?
>>> 
>>> You stated that the file is 5G but what is the size of a single field and 
>>> how is it distributed?  In other words is it already aggregated into a nice 
>>> blocksize or are you expecting netcdf/MPI-IO to handle that?
>>> 
>>> I think that in order to really get a good idea of where the performance 
>>> problem might be, you need to start by writing and timing a binary file of 
>>> roughly equivalent size, then write an hdf5 file, then write a netcdf4 
>>> file.    My guess is that you will find that the performance problem is 
>>> lower on the tree...
>>> 
>>> - Jim
>>> 
>>> On Mon, Sep 19, 2011 at 10:28 AM, Wei Huang <huangwei@xxxxxxxx> wrote:
>>> Hi, netcdfgroup,
>>> 
>>> Currently, we are trying to use parallel-enabled NetCDF4. We started with 
>>> read/write a 5G file and some computation, we got the following timing (in 
>>> wall-clock) on a IBM power machine:
>>> Number of Processors    Total(seconds)  read(seconds)   Write(seconds)  
>>> Computation(seconds)
>>> seq                                     89.137          28.206          
>>> 48.327          11.717
>>> 1                                       178.953         44.837          
>>> 121.17          11.644
>>> 2                                       167.25          46.571          
>>> 113.343         5.648
>>> 4                                       168.138         44.043          
>>> 118.968         2.729
>>> 8                                       137.74          25.161          
>>> 108.986         1.064
>>> 16                                      113.354         16.359          
>>> 93.253          0.494
>>> 32                                      439.481         122.201         
>>> 311.215         0.274
>>> 64                                      831.896         277.363         
>>> 588.653         0.203
>>> 
>>> First thing we can see is that when run parallel-enabled code at one 
>>> processor, the total
>>> wall-clok time doubled.
>>> Then we did not see the scaling when more processors added.
>>> 
>>> Anyone wants to share their experience?
>>> 
>>> Thanks,
>>> 
>>> Wei Huang
>>> huangwei@xxxxxxxx
>>> VETS/CISL
>>> National Center for Atmospheric Research
>>> P.O. Box 3000 (1850 Table Mesa Dr.)
>>> Boulder, CO 80307-3000 USA
>>> (303) 497-8924
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> netcdfgroup mailing list
>>> netcdfgroup@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit: 
>>> http://www.unidata.ucar.edu/mailing_lists/
>>> 
>> 
> 
>> _______________________________________________
>> netcdfgroup mailing list
>> netcdfgroup@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit: 
>> http://www.unidata.ucar.edu/mailing_lists/ 
> 
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA



  • 2011 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: