NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
All,I was wondering if anyone out there has encountered issues with NetCDF 4.3.2 and Intel 15.0.0.090 (just released) because I seem to have encountered one in our application where we throw an FPE writing a file.
To wit, I work on the GEOS-5 GCM and our Baselibs build things like HDF4, HDF5, Netcdf, etc for use with our code. Our normal build of netcdf in the Baselibs usually just a simple one configured as:
netcdf.config : netcdf/configure @echo "Configuring netcdf $*" @(cd netcdf; \ export PATH="$(prefix)/bin:$(PATH)" ;\ export CPPFLAGS="$(CPPFLAGS) $(INC_SUPP)";\ export LIBS="-L$(prefix)/lib -lmfhdf -ldf -lsz -ljpeg $(LINK_GPFS) $(LIB_CURL) -lm" ;\ ./configure --prefix=$(prefix) \ --includedir=$(prefix)/include/netcdf \ --enable-hdf4 \ --enable-dap \ $(NC_PAR_TESTS) \ --disable-shared \ --enable-netcdf-4 \ CC=$(NC_CC) FC=$(NC_FC) CXX=$(NC_CXX) F77=$(NC_F77) )
In this case, since we build for Parallel HDF5, that means our CC=mpicc, FC=mpif90, etc. I built two versions, both with Intel 15.0.0.090, using MVAPICH2 2.0 and Intel MPI 5.0.1.035 and both show the issue.
I did a "make check" with my two netcdf builds and they both passed most of the tests (some dap tests fail, I think, because I'm on a compute node where no outside internet is seen) so it must not be a simple fail.
So, my first thought was, well, let's add '-g -O0' and rebuild the library and get to the bottom of this, and, of course, the code runs just fine now! So, my guess is that it has something to do with the optimizer.
Then, I built the library explicitly with "-g -O" and I get the same FPE as before, so it seems as if the optimizer has done...something. Totalview shows that when we go to write an output NC4 file we get an FPE and the stack trace leads to var_create_dataset[1]:
var_create_dataset, FP=7fff42867cb0 write_var, FP=7fff42867d70 nc4_rec_write_metadata, FP=7fff42867de0 nc4_enddef_netcdf4_file, FP=7fff42867e00 NC4__enddef, FP=7fff42867e20 nc_enddef, FP=7fff42867e40 ncendef, FP=7fff42867e50 ncendf_, FP=7fff42867e60 cfio_create_, FP=7fff4286a900 esmf_cfiosdffilecreate, FP=7fff4286b470 esmf_cfiofilecreate, FP=7fff4286b4c0
and points to line 1453-4 of libsrc4/nc4hdf.c:
1449 /* Unlimited dim always gets chunksize of 1. */ 1450 if (dim->unlimited) 1451 chunksize[d] = 1; 1452 else 1453 chunksize[d] = pow((double)DEFAULT_CHUNK_SIZE/type_size, 1454 1/(double)(var->ndims - unlimdim)); 1455
In Totalview, I see that "type_size" is said to be "0" which, of course, will do bad things and might be causing the FPE. Since type_size is determined from things within var, who knows if a struct is clobbered or what.
Has anyone else seen this? I suppose for now I can just point to the debug-netcdf build so I can continue developing/testing with Intel 15 though I don't know what the cost of running netCDF at -O0 is.
Thanks, Matt[1] Yes, that does indeed say ncendf because this code has been around a while in our model and no one has wanted to translate all the ancient netcdf calls to actual modern ones for fear of breaking something crucial. But, in the end, it's still calling the right call it needs to.
-- Matt Thompson SSAI, Sr Software Test Engr NASA GSFC, Global Modeling and Assimilation Office Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771 Phone: 301-614-6712 Fax: 301-614-6246
netcdfgroup
archives: