NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

Re: [netcdfgroup] (known) failure in ncdump with Sun compilers + additional test case for `make check`

  • To: Thomas Orgis <thomas.orgis@xxxxxx>
  • Subject: Re: [netcdfgroup] (known) failure in ncdump with Sun compilers + additional test case for `make check`
  • From: Russ Rew <russ@xxxxxxxxxxxxxxxx>
  • Date: Tue, 21 Jul 2009 10:40:05 -0600
Hi Thomas,

You wrote:
> OK, some update on that one: I applied the workaround of compiling
> dumplib.o with -O0.  This makes `make check` (OK, in my case, `gmake
> check` ... ) succeed, but the resulting ncdump is still broken.
> 
> Again, two points:
> 1. I suggest adding another test case, with the cdl file I am about to
> paste. 

Thanks for this new test.  As it's apparently stricter than the ncdump
tests we have, we'll add it.

> 2. I again would like to know if someone reported this to Sun. This
> miscompilation is really a serious issue and should be addressed. I
> will report it myself if there is noone giving notice... 

The user who reported and helped investigate this problem in early
February also committed to reporting the bug to Sun.  You can read about
my unsuccessful attempts to isolate the bug to a smaller program than
ncdump or to find a workaround that would not trigger the bug here:

  http://www.unidata.ucar.edu/support/help/MailArchives/netcdf/msg05358.html

The details of the bug, as reported by that user, are:

  here's the solution with the Sun compiler:

  ncdump/dumplib.c must be compiled using -O0 explicitly, otherwise
  -O2 is used by default. By hand, just remove dumplib.o, add -O0
  to CFLAGS in the Makefile (second occurence), and gmake . The
  depending programs are recompiled and the tests succeed.

  This seems to be an optimizer bug, I've checked the code produced,
  and it does not set the xmm0 register in the complicated
  version and breaks calling ABI for libc, whereas your simple code
  below shows that it is set as expected. The value printed is just
  a random value in xmm0 used for something before. I've just halted the code
  before entering snprintf and set xmm0 explicitly to the value, continued,
  and, voila, the value printed is correct !

  The instructions generated are TOTALLY different, a symptom I've seen
  very often, just adding a line somewhere completely changes the
  generated code, which makes it really hard to track down such errors.

  I'll report this to Sun, maybe they've a better clue why this happens.

I've been unable to determine that it got successfully logged as a Sun
compiler bug.  If we can't find it after a little more searching, we'll
report it again.

> Down to the mode of failure. I generate a test NetCDF file from this
> CDL:
> 
> netcdf bubble {
> dimensions:
>         element = 1000 ;
>         variable = 1 ;
>         base = 1 ;
>         time = UNLIMITED ; // (0 currently)
> variables:
>         double time(time) ;
>         double coefficient(time, element, variable, base) ;
> 
> // global attributes:
>                 :info = "Model state for the AWI DG model, ThOr breed." ;
>                 :par_stringsize = 30 ;
>                 :par_base_grades = 0, 0, 0 ;
>                 :par_grid_elements = 10, 10, 10 ;
>                 :par_hill_params = 0.01, 0.1, 0.1, 0.1 ;
>                 :par_linad_speed = 1., 1., 1. ;
>                 :par_oro_types = "null                          null          
>                 null" ;
>                 :par_shallow_gravity = 1. ;
>                 :par_sys_name = "linear advection" ;
>                 :par_timeint_rksteps = 1 ;
>                 :par_timeint_step = 0.1 ;
>                 :par_trans_gradients = 2., 2., 2. ;
>                 :par_trans_types = "linear                        linear      
>                   linear" ;
>                 :par_world_dims = 3 ;
>                 :par_world_lengths = 10., 10., 10. ;
> data:
> }
> 
> 
> shell$ ncgen -o bubble.nc bubble.cdl
> 
> Now I have a look at it with ncdump compiled with CFLAGS=-m64 overall,
> but dumplib.o being built with CFLAGS='-O0 -m64' instead:
> 
> shell$ ncdump bubble.nc 
> netcdf bubble {
> dimensions:
>         element = 1000 ;
>         variable = 1 ;
>         base = 1 ;
>         time = UNLIMITED ; // (0 currently)
> variables:
>         double time(time) ;
>         double coefficient(time, element, variable, base) ;
> 
> // global attributes:
>                 :info = "Model state for the AWI DG model, ThOr breed." ;
>                 :par_stringsize = 30 ;
>                 :par_base_grades = 0, 0, 0 ;
>                 :par_grid_elements = 10, 10, 10 ;
>                 :par_hill_params = 2.22044604925031e-16, 0.999999992549419, 
> 0.999999992549419, 0.999999992549419 ;
>                 :par_linad_speed = 0.999999992549419, 0.999999992549419, 
> 0.999999992549419 ;
>                 :par_oro_types = "null                          null          
>                 null" ;
>                 :par_shallow_gravity = 0.999999992549419 ;
>                 :par_sys_name = "linear advection" ;
>                 :par_timeint_rksteps = 1 ;
>                 :par_timeint_step = 0.999999992549419 ;
>                 :par_trans_gradients = 0.999999992549419, 0.999999992549419, 
> 0.999999992549419 ;
>                 :par_trans_types = "linear                        linear      
>                   linear" ;
>                 :par_world_dims = 3 ;
>                 :par_world_lengths = 0.999999992549419, 0.999999992549419, 
> 0.999999992549419 ;
> data:
> }
> 
> That looks grossly wrong. Rebuilding everything inside the ncdump/
> directory with CFLAGS="-O0 -m64" results into a working ncdump binary,
> output is identical to input CDL file.  This is disturbing also as it
> leads to the question if my application will be affected by the same
> bug that harrasses ncdump when building with Sun Studio. Did really
> nonone investigate the mode of breakage and why it apparently(?!) does
> not affect other parts of NetCDF?

Yes, we investigated to the point of determining that it was a compiler
bug when compiling with -m64 for 64-bit environment, and we tried
unsuccessfully to find a workaround other than using -O0 when compiling.

> So... shall one start crying at Sun to fix their compiler on
> Solaris/x86-64 with NetCDF or is there some hidden wisdom already that
> I am not aware of?

It would probably help if you could also report this bug.

--Russ