Re: [netcdfgroup] NetCDF for parallel usage

To: Samrat Rao <samrat.rao@xxxxxxxxx>
Subject: Re: [netcdfgroup] NetCDF for parallel usage
From: Rob Latham <robl@xxxxxxxxxxx>
Date: Sat, 18 Oct 2014 13:05:20 -0500



On 10/18/2014 04:39 AM, Samrat Rao wrote:


Hi Rob & Ed,

I think that the machine i am using is not that bad. It was commissioned
in '12. Some basic info:

Performance
360 TFLOPS Peak & 304 TFLOPS sustained on LINPACK
Hardware
HP blade system C7000 with BL460c Gen8 blades
1088 nodes with 300 GB disk/node (319 TB)
2,176 Intel Xeon E5 2670 processors@ 2.6 GHz
17,408 processor cores, 68 TB main memory
FDR Infiniband based fully non-blocking fat-tree topology
2 PB high performance storage with lustre parallel file system


OK,then let's work up the software stack.

You've got a lustre file system, so you're going to need a halfwaydecent MPI-IO implementation. good news: OpenMPI, MPICH, and MVAPICH allhave good lustre drivers. Please ensure you are running somethingclose to the latest version. (sometimes we find users -- somehow --running ten year old MPICH code)


You need a recent-ish HDF5 library to make full use of the MPI-IO library.

You need the very latest netcdf library for assorted bug fixes (andcompatibility with the latest HDF5 library)


Debugging this stack over the mailing list is a bit of a challenge.

==rob


----

Using netCDF configured for parallel applications, i did manage to write
data on a single netCDF file using 512 procs --- but this was when i
reduced the grid nodes per proc to about 20-30. When i made the grid
nodes to about 100 i got this error too:

NetCDF: HDF error

----

There is another issue i need to share --- while compiling netCDF4 for
parallel usage, i had encountered errors during 'make check' in these
files: run_par_test.sh, run_f77_par_test.sh and run_f90_par_test.sh

These were related to mpiexec commands --- mpd.hosts issue. These errors
did not occur when i compiled netcdf for parallel on my desktop.

----

Dumping outputs from each processor gave me these  errors --- it is not
that all such errors appear together - they are a bit random.

proxy:0:13@cn0083] HYDT_bscu_wait_for_completion
(./tools/bootstrap/utils/bscu_wait.c:73): one of the processes
terminated badly; aborting
[proxy:0:13@cn0083] HYDT_bsci_wait_for_completion
(./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting
for completion
[proxy:0:13@cn0083] HYD_pmci_wait_for_childs_completion
(./pm/pmiserv/pmip_utils.c:1476): bootstrap server returned error
waiting for completion
[proxy:0:13@cn0083] main (./pm/pmiserv/pmip.c:392): error waiting for
event children completion

[mpiexec@cn0002] control_cb (./pm/pmiserv/pmiserv_cb.c:674): assert
(!closed) failed
[mpiexec@cn0002] HYDT_dmxu_poll_wait_for_event
(./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@cn0002] HYD_pmci_wait_for_completion
(./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event
[mpiexec@cn0002] main (./ui/mpich/mpiexec.c:718): process manager error
waiting for completion

cn0137:b279:beba2700: 132021042 us(132021042 us!!!):  ACCEPT_RTU: rcv
ERR, rcnt=0 op=1 <- 10.1.1.136
cn1068:48c5:4b280700: 132013538 us(132013538 us!!!):  ACCEPT_RTU: rcv
ERR, rcnt=-1 op=1 <- 10.1.5.47
cn1075:dba3:f8d7700: 132099675 us(132099675 us!!!):  CONN_REQUEST:
SOCKOPT ERR Connection refused -> 10.1.1.51 16193 - RETRYING... 5
cn1075:dba3:f8d7700: 132099826 us(151 us):  CONN_REQUEST: SOCKOPT ERR
Connection refused -> 10.1.1.51 16193 - RETRYING...4
cn1075:dba3:f8d7700: 132099942 us(116 us):  CONN_REQUEST: SOCKOPT ERR
Connection refused -> 10.1.1.51 16193 - RETRYING...3
cn1075:dba3:f8d7700: 132100049 us(107 us):  CONN_REQUEST: SOCKOPT ERR
Connection refused -> 10.1.1.51 16193 - RETRYING...2
cn1075:dba3:f8d7700: 132100155 us(106 us):  CONN_REQUEST: SOCKOPT ERR
Connection refused -> 10.1.1.51 16193 - RETRYING...1
cn1075:dba3:f8d7700: 132100172 us(17 us): dapl_evd_conn_cb() unknown
event 0x0

----

Rob, I guess i will need to look into the io methods you listed.

Thanks for your time,
Samrat.


On Fri, Oct 17, 2014 at 10:00 PM, Rob Latham <robl@xxxxxxxxxxx
<mailto:robl@xxxxxxxxxxx>> wrote:



    On 10/17/2014 11:25 AM, Ed Hartnett wrote:

        Unless things have changed since my day, it is possible to read
        pnetcdf
        files with the netCDF library. It must be built with
        --enable-pnetcdf
        and with-pnetcdf=/some/location, IIRC.


    Ed!

    In this case, Samrat Rao was using pnetcdf to create CDF-5 (giant
    variable) formatted files.  To refresh your memory,  Argonne and
    Northwestern developed this file format with UCARS's signoff, with
    the understanding that we (ANL and NWU) would never expect UCAR to
    add support for it unless we did the work.  I took a stab at it a
    few years back and Wei-keng is taking a second crack at it right now.

    the classic file formats CDF-1 and CDF-2 are fully inter-operable
    between pnetcdf and netcdf.
    ==rob



        On Fri, Oct 17, 2014 at 6:33 AM, Samrat Rao
        <samrat.rao@xxxxxxxxx <mailto:samrat.rao@xxxxxxxxx>
        <mailto:samrat.rao@xxxxxxxxx <mailto:samrat.rao@xxxxxxxxx>>> wrote:

             Hi,

             I'm sorry for the late reply.

             I have no classic/netcdf-3 datasets --- datasets are to be
             generated. All my codes are also new.

             Initially i tried with pnetcdf, wrote a few variables, but
        found
             that the format was CDF-5 which 'normal' netcdf would not read.

             I also need to read some bits of netcdf data in Matlab, so
        i thought
             of sticking to the usual netcdf-4 compiled for parallel io.
        It is
             also likely that i will have to share my workload with
        others in my
             group and/or leave the code for future people to work on.

             Does matlab read cdf-5 files?

             So i preferred the usual netcdf. Rob, i hope you are not
        annoyed.

             But most of the above is for another day. Currently i am stuck
             elsewhere.

             With a less no of processors, 216, the single netcdf file gets
             created (i create i single netcdf file for each variable),
        but for
             anything above that i get these errors:
             NetCDF: Bad chunk sizes.
             Not sure where these errors come from.

             Then i shifted to dumping outputs from each processor in simple
             binary --- this works till about 1500 processors. Above
        this number
             the code gets stuck and eventually aborts.

             This issue is not new. My colleague too had problems with
        running
             his code on 1500+ procs.

             Today i came to know that opening a large number of files
        (each proc
             writes 1 file) can overwhelm the system --- solving this
        requires
             more than rudimentary techniques of writing --- or
        understanding the
             system's inherent parameters/bottlenecks.

             So netcdf is probably out of bounds for now --- will try
        again if
             the simple binary write from each processor gets sorted out.

             Does anyone have any suggestion?

             Thanks,
             Samrat.


             On Thu, Oct 2, 2014 at 7:52 PM, Rob Latham
        <robl@xxxxxxxxxxx <mailto:robl@xxxxxxxxxxx>
             <mailto:robl@xxxxxxxxxxx <mailto:robl@xxxxxxxxxxx>>> wrote:



                 On 10/02/2014 01:24 AM, Samrat Rao wrote:

                     Thanks for your replies.

                     I estimate that i will be requiring approx 4000
        processors
                     and a total
                     grid resolution of 2.5 billion for my F90 code. So
        i need to
                     think/understand which is better - parallel netCDF
        or the
                     'normal' one.


                 There are a few specific nifty features in pnetcdf that
        can let
                 you get really good performance, but 'normal' netCDF is
        a fine
                 choice, too.

                     Right now I do not know how to use parallel-netCDF.


                 It's almost as simple as replacing every 'nf' call with
        'nfmpi'
                 but you will be just fine if you stick with UCAR netCDF-4

                     Secondly, i hope that the netCDF-4 files created by
        either
                     parallel
                     netCDF or the 'normal' one are mutually compatible. For
                     analysis I will
                     be extracting data using the usual netCDF library,
        so in
                     case i use
                     parallel-netCDF then there should be no
        inter-compatibility
                     issues.


                 For truly large variables, parallel-netcdf introduced,
        with some
                 consultation from the UCAR folks, a 'CDF-5' file
        format.  You
                 have to request it explicitly, and then in that one
        case you
                 would have a pnetcdf file that netcdf tools would not
        understand.

                 In all other cases, we work hard to keep pnetcdf and
        "classic"
                 netcdf compatible.  UCAR NetCDF has the option for an
        HDF5-based
                 backend -- and in fact it's not an option if you want
        parallel
                 I/O with NetCDF-4 -- is not compatible with
        parallel-netcdf.  By
                 now, your analysis tools surely are updated to
        understand the
                 new HDF5-based backend?

                 I suppose it's possible you've got some 6 year old
        analysis tool
                 that does not understand NetCDF-4's HDF5-based file format.
                 Parallel-netcdf would allow you to simulate with
        parallel i/o
                 and produce a classic netCDF file.  But I would be
        shocked and a
                 little bit angry if that was actually a good reason to use
                 parallel-netcdf in 2014.


                 ==rob


                 --
                 Rob Latham
                 Mathematics and Computer Science Division
                 Argonne National Lab, IL USA




             --

             Samrat Rao
             Research Associate
             Engineering Mechanics Unit
             Jawaharlal Centre for Advanced Scientific Research
             Bangalore - 560064, India

             _________________________________________________
             netcdfgroup mailing list
        netcdfgroup@xxxxxxxxxxxxxxxx
        <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
        <mailto:netcdfgroup@unidata.__ucar.edu
        <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>>
             For list information or to unsubscribe,  visit:
        http://www.unidata.ucar.edu/__mailing_lists/
        <http://www.unidata.ucar.edu/mailing_lists/>



    --
    Rob Latham
    Mathematics and Computer Science Division
    Argonne National Lab, IL USA




--

Samrat Rao
Research Associate
Engineering Mechanics Unit
Jawaharlal Centre for Advanced Scientific Research
Bangalore - 560064, India


--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Follow-Ups:
- Re: [netcdfgroup] NetCDF for parallel usage
  - From: Ed Hartnett

References:
- [netcdfgroup] NetCDF for parallel usage
  - From: Samrat Rao
- Re: [netcdfgroup] NetCDF for parallel usage
  - From: Rob Latham
- Re: [netcdfgroup] NetCDF for parallel usage
  - From: Samrat Rao
- Re: [netcdfgroup] NetCDF for parallel usage
  - From: Ed Hartnett
- Re: [netcdfgroup] NetCDF for parallel usage
  - From: Rob Latham
- Re: [netcdfgroup] NetCDF for parallel usage
  - From: Samrat Rao