Re: [netcdfgroup] Diff between record and non-record data

To: Dave Allured - NOAA Affiliate <dave.allured@xxxxxxxx>
Subject: Re: [netcdfgroup] Diff between record and non-record data
From: Michael Powell <mwpowellhtx@xxxxxxxxx>
Date: Fri, 15 May 2015 13:39:43 -0400
On Mon, May 11, 2015 at 5:13 PM, Michael Powell <mwpowellhtx@xxxxxxxxx> wrote:
> On Mon, May 11, 2015 at 3:28 PM, Dave Allured - NOAA Affiliate
> <dave.allured@xxxxxxxx> wrote:
>> Michael,
>>
>> On Sun, May 10, 2015 at 2:35 PM, Michael Powell <mwpowellhtx@xxxxxxxxx>
>> wrote:
>>>
>>> It's been a few years since I've looked at netCDF at all, so this is a
>>> refresher. Also, this has no doubt been covered a million times, but I
>>> am confused about the nature of the data block.
>>
>>
>> You are inquiring specifically about netcdf-3 format (classic and 64-bit
>> offset), not netcdf-4 (HDF5), right?
>
> For now, yessir. Thanks for the follow up.
>
>>  Unless you have some very good reason for attempting to access netcdf-3
>> format directly, please use the netcdf libraries (C, fortran, etc.) to read
>> and write.  Rough knowledge of the low-level format is helpful for
>> understanding performance issues.
>
> It's good to know. Mostly, morbid curiosity, I enjoy working with file
> formats like this, but I did take note that there are C/C++ and other
> API readily available. I like to push my architectural skills over
> areas like this.

Excepting for what I believe is a compiler/platform difference in how
readers/writers deal with single and double wide floating point
values, I believe I have a purely C++ model/reader/writer functioning
properly. Verified on a couple of example NC files for comparison
purposes.

>>> If I understand the header/data parts correctly (which, I may not),
>>> non-record can be 'infinite'?
>>
>>
>> Not infinite, rather about 8 exabytes.  But that is probably what you meant.
>
> Practically infinite, at any rate.
>
>> This is shown in a good summary table in the older netcdf documentation:
>> http://www.unidata.ucar.edu/software/netcdf/docs/netcdf/Large-File-Support.html
>>
>>>
>>> Rather, the non-record, fixed-length
>>> parts should occur prior to the record data?

There are probably some nuances about record versus non-record
metadata, data, and so on, and notwithstanding what I think is
compiler/platform read/write differences, I believe I've got it at
least presentable. I may post it on github and open it up to feedback.
It's presently tested in VS2013 Update 4, and uses current includes,
C++11 in places, to facilitate operation.

>> Yes.  I think the following overview is a little less confusing than in the
>> BNF spec.  Think of all netcdf-3 files as having five possible parts.  All
>> parts except the header are optional.  The two pads are a minor detail,
>> don't worry about them:
>>
>> * Header
>> * Header pad
>> * Fixed length variables
>> * Data pad
>> * Record data

I'm not sure what we mean by so-called "header pad". Most fields that
call for padding are pretty well self-contained: i.e. for all
name/text oriented properties/values. Similar for "data pad"; pretty
well, self-contained, self-describing, and/or inherent part of the
model (i.e. using std::vectors).

>>> How does one tell what is record versus non-record?
>>
>>
>> Many applications don't bother to look.  Basic read access works the same
>> either way, except for performance issues with large files.
>>
>> If you care, the manual method is to look for a dimension labeled
>> "unlimited" in ncdump -h output.  Any variables using this dimension are
>> record variables.  All others are fixed length variables.
>>
>> There are inquiry functions which can do this under program control.
>>
>>>
>>> I think this is as a function of the var nelems, dimid, etc, but it's
>>> not especially clear from the BNF grammar, or associated verbiage. For
>>> example, what is meant by 'interleaved'? As a cross section of the
>>> specified variables?
>>
>>
>> Only record variables are interleaved.  The sub-arrays ("cross sections")
>> for record subscript #0 of all record variables are combined together at the
>> start of the variable length section.  The combined block for subscript #0
>> is often referred to casually as one "record".  The combined block for
>> subscript #1 is the next record.  And so on.
>>
>> Suppose you have precip(time, lat, lon) and time(time), and time(time).
>> Then the first record is precip(0,*,*) and time(0) together, and so on.
>>
>>> Also taking into account spiked variable array dimensions?
>>
>>
>> What do you mean by "spiked"?
>
> Jagged arrays: something like this (turned 90 degrees):
>
> ++
> ++++
> +
> ++++++
> ++++++
> ++
>
> And so on.
>
>>> I'm sure I'm barely scratching the surface here... Insights are welcome.
>>>
>>> Thank you...
>>>
>>> Regards,
>>>
>>> Michael
>>
>>
>> --Dave

Thanks again...
References:
- [netcdfgroup] Diff between record and non-record data
  - From: Michael Powell
- Re: [netcdfgroup] Diff between record and non-record data
  - From: Dave Allured - NOAA Affiliate