[netcdfgroup] Important: potential file corruption using NOFILL mode

To: netcdfgroup@xxxxxxxxxxxxxxxx
Subject: [netcdfgroup] Important: potential file corruption using NOFILL mode
From: Russ Rew <russ@xxxxxxxxxxxxxxxx>
Date: Fri, 29 Apr 2011 15:53:42 -0600
Hi,

A netCDF bug first reported by Jeorg Henrichs and described under
"Known Problems with NetCDF Distributions":

  http://www.unidata.ucar.edu/netcdf/docs/known_problems.html#lustre

is apparently more serious than we first thought, although it appears to
only occur rarely.  The bug is fixed in the upcoming netCDF-4.1.3-beta
release, which is currently undergoing final testing.

This "nofill bug" has been in netCDF-3 releases since at least 1999.  In
summary, writing data in nofill mode that crosses disk block boundaries
more than a disk block beyond the end of a file under some circumstances
can zero out previously written data that hasn't yet been flushed to
disk.

The following conditions are necessary for the nofill bug to occur:

  1.  Writing data to a netCDF classic format file or 64-bit offset
      file using nofill mode (not the library default, but used in
      some utility software).

  2.  Writing data that crosses the boundary between one and two disk
      blocks beyond the last block in the file, as might happen when
      writing a multidimensional variable by slices in reverse order.

The above conditions are required, but are not sufficient.  Occurrence
of the bug also depends on the amount of data, where the data is written
in terms of disk block boundaries, and the current state of a memory
buffer.  These additional conditions make the bug unlikely, but more
likely on file systems with large disk I/O block sizes.  The bug was
first reported on a high-performance file system that uses 2MB disk
blocks.

The result of the bug is a corrupt file, with data earlier in the file
overwritten with zeros.  The earlier data is overwritten with no
indication that an error occurred, so the user may think the data is
correctly stored.

We've verified the bug exists in all previous versions of C-based
netCDF-3 releases, but not in netCDF-Java.  Perhaps it wasn't noticed
until recently, because all the systems on which we were testing have
small disk block sizes (8 KB or less), and the bug is more likely with
large disk blocks.  Also, most of our tests don't write the last
variable in a file backwards, leaving file system "holes" when written
in nofill mode.

Writing data in nofill mode requires a call such as one of the following
for C, Fortran-77, Fortran-90, or C++ APIs, respectively:

  nc_set_fill(ncid, NC_NOFILL, &old_fill_mode)
  nf_set_fill(ncid, NF_NOFILL, old_fill_mode)
  nf90_set_fill(ncid, NF90_NOFILL, old_fill_mode)
  file->set_fill(NcFile::NoFill)

More information about nofill mode is available here:

  http://www.unidata.ucar.edu/netcdf/docs/netcdf-c.html#nc_005fset_005ffill

Some widely used software, such as NCO, have used nofill mode as a
default for better performance, so a user might not be aware that files
are being written in nofill mode.  A separate announcement of a new
release of NCO that doesn't use nofill mode will appear soon.

Although the nofill bug is fixed in netCDF version 4.1.3, to which we
recommend upgrading, other workarounds include

 - avoiding use of nofill mode
 - enabling share mode (NC_SHARE)
 - not writing large variables at the end of a file in reverse of the
   order in which values are stored
 - using netCDF-4
   
More details are available in our bug database, if you're interested:

  http://www.unidata.ucar.edu/jira/browse/NCF-22

A separate patch to netCDF 4.1.2 that just fixes this bug is also
available here:

  http://www.unidata.ucar.edu/netcdf/patches/nofill-bug.patch

If you're interested in how to determine your disk I/O block size, see,
for example:

  http://www.linfo.org/get_block_size.html
 
--Russ
2011 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: