NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.

Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)

  • To: Rob Latham <robl@xxxxxxxxxxx>
  • Subject: Re: [netcdfgroup] random read failures with large CF-2 files (on Lustre?)
  • From: Gerry Creager - NOAA Affiliate <gerry.creager@xxxxxxxx>
  • Date: Wed, 19 Aug 2015 20:55:02 +0000
I'll open a case to determine if Cray's MPI-IO library has this problem.

gerry

On Wed, Aug 19, 2015 at 7:47 PM, Rob Latham <robl@xxxxxxxxxxx> wrote:

>
>
> On 08/18/2015 02:31 PM, Ward Fisher wrote:
>
>> Hello all,
>>
>> I just wanted to jump in and comment that this issue, recently reported
>> to us by David Knaak at Cray, is now handled in the netCDF-C development
>> branch on GitHub. This fix will be in the upcoming release candidate and
>> eventual final release of netCDF-C 4.4.0.
>>
>> Regarding the question of short reads providing more warning; netcdf
>> specifically was already checking for short reads when ‘paging in’ data
>> from a file, but was assuming an error when one would occur (due to a
>> non-zero |errno| value). The fix shouldn’t incur any performance
>> penalty. The new thing I learned about “short reads” is that it is
>> possible for this to occur /without/ being the result of an error, but
>> rather the result of an interrupt.
>>
>
> I found these short reads would happen in ROMIO when trying to read 2 GiB
> of data in one shot.  Linux would give me back (2GiB-4k) worth of data.
>
> Today, most MPI-IO libraries should detect and retry this case.  Cray's
> MPI-IO library is closed source, so i don't know what they do.
>
> In general, since they are technically allowed I think developers are
>> going to have to accommodate the possibility of short reads in their
>> software, one way or another. Developers should already be checking the
>> return value of |read()|, and when short, the fix is essentially:
>>
>>  1. Check to see if errno is |EINTR|
>>  2. If so, perform some calculations and resume the read.
>>
>
> While that's strictly correct, I worry about short reads that for whatever
> reason don't set EINTR.  So I would check how much data was read.  If it is
> less than requested, continue the read to fetch the missing data.  If that
> continued read returns 0, then you are EOF and you are done.
>
> ==rob
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
>
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit:
> http://www.unidata.ucar.edu/mailing_lists/
>



-- 
Gerry Creager
NSSL/CIMMS
405.325.6371
++++++++++++++++++++++
“Big whorls have little whorls,
That feed on their velocity;
And little whorls have lesser whorls,
And so on to viscosity.”
Lewis Fry Richardson (1881-1953)
  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdfgroup archives: