Re: ncdigest V1 #587 (standard handling of scale/offset and missi ng data)

To: "Davies, Harvey" <harvey.davies@xxxxxxxxxxxx>
Subject: Re: ncdigest V1 #587 (standard handling of scale/offset and missi ng data)
From: John Caron <caron@xxxxxxxxxxxxxxxx>
Date: Mon, 23 Apr 2001 13:47:05 -0600

The User's Guide states it is illegal to have valid_range if either
valid_min or valid_max is defined.  If
such a file exists in practice, I consider it better to force the user
to
delete attributes to avoid such
ambiguity.
I guess the problem is that theres no library enforcement of suchConventions, and so i am inclined to relax the rules if it doesnt causeconfusion.


I believe it would cause confusion.  I would prefer to provide an easy,
efficient way of deleting the redundant
attributes.


In my proposal I would ignore the valid_min and valid_max if valid_range
is present, which is arguably equivalent to deleting in this context.

2) a missing_value attribute may also specify a scalar or vector ofmissing values.
Yes, but note that this attribute is merely a hint for output & should
be
ignored on input.
I dont understand why you ignore it on input.


We wanted to keep things simple and reasonably efficient.  The valid range
is defined by valid_min,
valid_max, valid_range and _FillValue.  The test for missing involves zero,
one or to two comparisons.
I would not like to have to do more than two comparisons.  Even two is quite
time consuming.  We
could have chosen to use missing_value of none of the above four attributes
were defined, but we
decided against this just (for simplicity if I remember correctly).

What if there is novalid_range specified?


As suggested above, we could have chosen to use missing_value like
_FillValue when none of these
four attributes was defined, but missing_value was defined. I would prefer
to force renaming of
missing_value to _FillValue, but I'm prepared to admit this may be
unreasonably harsh.

What if the missing_data is inside the valid_range?


I assume you mean missing_value.  There is no problem if missing_value is
merely a hint for output.
You simply alway ignore it (on input at least)!

I am reading an existing file, theres no option of deleting or forcingrenaming. I want to do as good of job as possible in extracting meaningout of the file. Someone has taken the time to add a missing_valueattribute to the file. It doesnt seem reasonable to ignore it, because:if they are followng the UG conventions, it wont be there, so noproblem. The fact that it exists means they arent following theconventions (or have made a mistake) so you dont know what logic theyare following. In that case, I would fall back on what a reasonalbleperson would intend (who doesnt know about the conventions).

3) if there is no missing_value attribute, the _FillValue attributecan be used to specify a scalar missing value.
For what purpose?  This could be reasonable on input if you are defining
an
internal missing value, but
my understanding of your proposal is that you are simply defining an
array
of data.
I'm not sure if I understand. Through the hasMissing() and isMissing()methods I am providing a service of knowing when the data ismissing/invalid.
I am thinking of an application which has an internal missing value for each
variable.  In this case the decision
on whether data is missing is not part of the input process, but done later.
I gather this is not the case with
your proposed routines.
OK, I understand _FillValue better, thanks. Two things though: 1) itseems reasonable to pre-fill an array with valid values, since perhapsonly a few data points need to be written that way.
I agree there may be cases where you want to pre-fill with a valid value of
say 0.  The UG states this is legal
even though against recommended practice.  We should have worded this more
clearly to make it clear this
is fine.

I dont understand. If _FillValue can be legal, how can you construct avalid_range with it, using it as the valid_min or valid_max? It seemslike contradictory uses.

The above ruleswould seem to preclude this. 2) Is the default fill value supposed tooperate the same way? If not, it seems funny that they might haveradically different meaning.
If none of the four above attributes is defined then all values are valid.
(Well not quite, I guess NaN can
hardly ever be considered 'valid'!! -- Incidentally I feel we should rethink
the recommendation not to use
NaN and other IEEE special values now that the IEEE standard is so widely
used.  I use NaN a lot.)

I agree NaN are good missing value indicators, and I will add this to myspec.

Implementation rules for missing data with scale/offset:
   1) valid_range is always in the units of the converted (unpacked)
data.
NO!!! See above.
The problem is that many important datasets use the internal units. Ithink theres a good argument that it is more natural since those wouldbe the units a human would think in. Is there anything in the currentmanual that specifies this? I just reread it again and I dont see it.
I must apologise for this omission.  Despite this omission, the convention
has always been that valid
range is external.  It may well have been more logical for it to be
internal, but it is too late to change it.
You could argue for it to be internal if the datatype matched the internal
type (i.e. that of scale_factor
and add_offset), but I think this would cause confusion.

To let you off the hook a bit, Brian Eaton and I decided that the UGwording "The type of each valid_range, valid_min and valid_max attributeshould match the type of its variable" mostly implies that valid_rangeis external.

Im thinking that if the scale_factor type matches the valid_range type,then consider it internal. I think this would solve most known datasets.Anyone on the list have counterexamples?

I think we should extend the UG wording to allow either way. But at themoment, I am not proposing that, just trying to read existing datasets.


I appreciate the work you've done on this, Harvey.

John.

References:
- RE: ncdigest V1 #587 (standard handling of scale/offset and missi
  - From: Davies, Harvey