Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue

To: Rob Latham <robl@xxxxxxxxxxx>
Subject: Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
From: Wei Huang <huangwei@xxxxxxxx>
Date: Mon, 19 Sep 2011 11:44:36 -0600

Rob,

Below is the result use pnetcdf (but under netcdf4).

Thanks,

Wei Huang
huangwei@xxxxxxxx
VETS/CISL
National Center for Atmospheric Research
P.O. Box 3000 (1850 Table Mesa Dr.)
Boulder, CO 80307-3000 USA
(303) 497-8924



Number of Processors    Total(seconds)  read(seconds)   Write(seconds)  
Computation(seconds)
seq                                     89.137          28.206          48.327  
        11.717
1                                       89.055          18.190          58.977  
        11.612
2                                       189.892         14.577          168.999 
        5.729
4                                       229.825         24.265          202.11  
        2.585
8                                       263.488         26.528          234.199 
        1.130
16                                      298.131         48.399          247.07  
        0.625
32                                      421.336         63.559          352.373 
        0.484
64                                      549.144         71.947          462.465 
        0.525


On Sep 19, 2011, at 11:36 AM, Rob Latham wrote:

> On Mon, Sep 19, 2011 at 11:09:23AM -0600, Wei Huang wrote:
>> Jim,
>> 
>> I am using the gpfs filesystem, but did not set any MPI-IO hints.
>> I did not do processor binding, but I guess binding could help if
>> less processors used on a node.
>> I am actually using NC_MPIPOSIX, rather than NC_MPIIO as the later will give
>> even worse timing.
>> 
>> The 5G file has 170 variables, with some of them have size:
>> [ 1 <time | unlimited>, 27 <ilev>, 768 <lat>, 1152 <lon> ]
>> and used chunk size (1, 1, 192, 288).
>> 
>> The last part more like a netcdf developers work.
> 
> Perhaps you can make the netcdf developers' job a bit easier by
> providing a test case.  If the dataset contains 170 variables, then it
> must be part of some larger program and so might be hard to extract.
> 
> I'll be honest: I'm mostly curious how pnetcdf handles this workload
> (my guess as a pnetcdf developer is "poorly" because of the record
> variable i/o).  Still, the test case will help the netcdf, hdf5, and
> MPI-IO developers...
> 
> ==rob
> 
>> On Sep 19, 2011, at 10:48 AM, Jim Edwards wrote:
>> 
>>> Hi Wei,
>>> 
>>> 
>>> Are you using the gpfs filesystem and are you setting any MPI-IO hints for 
>>> that filesystem?
>>> 
>>> Are you using any processor binding technique?   Have you experimented with 
>>> other settings?
>>> 
>>> You stated that the file is 5G but what is the size of a single field and 
>>> how is it distributed?  In other words is it already aggregated into a nice 
>>> blocksize or are you expecting netcdf/MPI-IO to handle that?
>>> 
>>> I think that in order to really get a good idea of where the performance 
>>> problem might be, you need to start by writing and timing a binary file of 
>>> roughly equivalent size, then write an hdf5 file, then write a netcdf4 
>>> file.    My guess is that you will find that the performance problem is 
>>> lower on the tree...
>>> 
>>> - Jim
>>> 
>>> On Mon, Sep 19, 2011 at 10:28 AM, Wei Huang <huangwei@xxxxxxxx> wrote:
>>> Hi, netcdfgroup,
>>> 
>>> Currently, we are trying to use parallel-enabled NetCDF4. We started with 
>>> read/write a 5G file and some computation, we got the following timing (in 
>>> wall-clock) on a IBM power machine:
>>> Number of Processors    Total(seconds)  read(seconds)   Write(seconds)  
>>> Computation(seconds)
>>> seq                                     89.137          28.206          
>>> 48.327          11.717
>>> 1                                       178.953         44.837          
>>> 121.17          11.644
>>> 2                                       167.25          46.571          
>>> 113.343         5.648
>>> 4                                       168.138         44.043          
>>> 118.968         2.729
>>> 8                                       137.74          25.161          
>>> 108.986         1.064
>>> 16                                      113.354         16.359          
>>> 93.253          0.494
>>> 32                                      439.481         122.201         
>>> 311.215         0.274
>>> 64                                      831.896         277.363         
>>> 588.653         0.203
>>> 
>>> First thing we can see is that when run parallel-enabled code at one 
>>> processor, the total
>>> wall-clok time doubled.
>>> Then we did not see the scaling when more processors added.
>>> 
>>> Anyone wants to share their experience?
>>> 
>>> Thanks,
>>> 
>>> Wei Huang
>>> huangwei@xxxxxxxx
>>> VETS/CISL
>>> National Center for Atmospheric Research
>>> P.O. Box 3000 (1850 Table Mesa Dr.)
>>> Boulder, CO 80307-3000 USA
>>> (303) 497-8924
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> netcdfgroup mailing list
>>> netcdfgroup@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit: 
>>> http://www.unidata.ucar.edu/mailing_lists/
>>> 
>> 
> 
>> _______________________________________________
>> netcdfgroup mailing list
>> netcdfgroup@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit: 
>> http://www.unidata.ucar.edu/mailing_lists/ 
> 
> 
> -- 
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA

References:
- [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
  - From: Wei Huang
- Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
  - From: Jim Edwards
- Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
  - From: Wei Huang
- Re: [netcdfgroup] NetCDF4 Parallel-enabled IO performance issue
  - From: Rob Latham