NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.

Re: [netcdfgroup] bug report (nc_put_varm_double hangs during collective parallel I/O): a follow-up

  • To: Dennis Heimbigner <dmh@xxxxxxxxxxxxxxxx>
  • Subject: Re: [netcdfgroup] bug report (nc_put_varm_double hangs during collective parallel I/O): a follow-up
  • From: Constantine Khroulev <ckhroulev@xxxxxxxxxx>
  • Date: Tue, 26 Feb 2013 08:08:46 -0900
Dennis and others:

Let me know if I can help.

It was a long time ago, but I'm pretty sure I still understand what
was going on there.

-- 
Constantine

On Tue, Feb 26, 2013 at 7:08 AM, Rob Latham <robl@xxxxxxxxxxx> wrote:
> On Fri, Feb 22, 2013 at 01:45:44PM -0700, Dennis Heimbigner wrote:
>> I recently rewrote nc_get/put_vars to no longer
>> use varm, so it may be time to revisit this issue.
>> What confuses me is that in fact, the varm code
>> writes one instance of the variable on each pass
>> (e.g. v[0], v[1], v[2],...). So I am not sure
>> how it is ever not writing the same on all processors.
>> Can the original person (Rob?) give me more details?
>
> I'm not the original person.  Constantine Khroulev provided a nice
> testcase last January (netcdf_parallel_2d.c).
>
> I just pulled netcdf4 from SVN (r2999) and built it with hdf5-1.8.10.
>
> Constantine Khroulev's test case hangs (though in a different place
> than a year ago...):
>
> Today, that test case hangs with one process in a testcase-level
> barrier (netcdf_parallel_ 2d.c:134) and one process stuck in
> nc4_enddef_netcdf4_file trying to flush data.
>
> This test case demonstrates the problem nicely.  Take a peek at it and
> double-check the testcase is correct, but you've got a nice driver to
> find and fix this bug.
>
> ==rob
>
>>
>> =Dennis Heimbigner
>>  Unidata
>>
>> Orion Poplawski wrote:
>> >On 01/27/2012 01:22 PM, Rob Latham wrote:
>> >>On Wed, Jan 25, 2012 at 10:06:59PM -0900, Constantine Khroulev wrote:
>> >>>Hello NetCDF developers,
>> >>>
>> >>>My apologies to list subscribers not interested in these (very)
>> >>>technical details.
>> >>
>> >>I'm interested!   I hope you send more of these kinds of reports.
>> >>
>> >>>When the collective parallel access mode is selected all processors in
>> >>>a communicator have to call H5Dread() (or H5Dwrite()) the same number
>> >>>of times.
>> >>>
>> >>>In nc_put_varm_*, NetCDF breaks data into contiguous segments that can
>> >>>be written one at a time (see NCDEFAULT_get_varm(...) in
>> >>>libdispatch/var.c, lines 479 and on). In some cases the number of
>> >>>these segments varies from one processor to the next.
>> >>>
>> >>>As a result as soon as one of the processors in a communicator is done
>> >>>writing its data the program locks up, because now only a subset of
>> >>>processors in this communicator are calling  H5Dwrite(). (Unless all
>> >>>processors have the same number of "data segments" to write, that is.)
>> >>
>> >>Oh, that's definitely a bug.  netcdf4 should call something like
>> >>MPI_Allreduce with MPI_MAX to figure out how many "rounds" of I/O will
>> >>be done (this is what we do inside ROMIO, for a slightly different
>> >>reason)
>> >>
>> >>>But here's the thing: I'm not sure this is worth fixing. The only
>> >>>reason to use collective I/O I can think of is for better performance,
>> >>>and then avoiding sub-sampled and mapped reading and writing is a good
>> >>>idea anyway.
>> >>
>> >>well, if varm and vars are the natural way to access the data, then
>> >>the library should do what it can to do that efficiently.   The fix
>> >>appears to be straightforward.  Collective I/O has a lot of advantages
>> >>on some platforms: it will automatically select a subset of processors
>> >>or automatically construct a file access most closely suited to the
>> >>underlying file system.
>> >>
>> >>==rob
>> >>
>> >
>> >Was this ever fixed?
>> >
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit: 
> http://www.unidata.ucar.edu/mailing_lists/



  • 2013 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: