Re: [netcdfgroup] storing sparse matrices data in NetCDF

Subject: Re: [netcdfgroup] storing sparse matrices data in NetCDF
From: Sourish Basu <Sourish.Basu@xxxxxxxxxxxx>
Date: Mon, 18 Mar 2019 19:58:40 -0600

Ken,

As someone who works with netcdf every day, I wouldn't want you to grid
your data if I were a user. Your data are basically 20,000 different
time series at 20,000 fixed lat/lon/elev locations. It's not inherently
on a grid, and there's no advantage to be had in gridding it into a
mostly empty hypercube. I think netcdf is very well suited for these
sorts of data as well, and in my work we exchange atmospheric
observation data -- also time series at discrete locations and times --
as netcdf files all the time. Given the simple data structure you've
gone for, it's easy enough for a user to write their own subsetting
script should they want to.

It's true that you see a lot of discussion in this forum about handling
gridded netcdf data with NCO and viewing them with panoply/ncview, but I
think that's just because netcdf tends to get used heavily in the world
of modelers -- be it land surface, atmosphere or ocean -- and models
typically output gridded fields. I personally wish more people with data
like yours would consider netcdf, so I'm really happy that you are :-)

Cheers,

Sourish

On 3/18/19 5:25 PM, Ken Mankoff wrote:
> I could always grid the data - I have 20,000 outlets around the coast of 
> Greenland at 90 m resolution. The total number of cells is 500 million. This 
> seems inefficient, but the 499,980,000 empty cells do compress quite well. It 
> is ~15 MB/day (5 GB/year). Maybe compressing multiple days improves on this? 
> 5 GB/year is tractable for 40 years of data.
>
> But as someone who works with NetCDF, wouldn't this data presentation make 
> you cringe? On the other hand, NetCDF works best when gridded, right?
>
>   -k.
>   
>
> On 2019-03-18 at 16:19 -0700, Gus Correa <gus@xxxxxxxxxxxxxxxxx> wrote...
>> Hi Ken
>>
>> That is true.
>> I suppose both CDO and NCO (ncks) assume the lat and lon are
>> monotonic (increasing or decreasing) coordinate variables, and that
>> runoff has (time,lat,lon) dimensions, not (time,ID).
>> ID is not a coordinate, it is just a label for your observation stations, I
>> guess.
>>
>> You could devise a more elaborate scheme to define lat and lon dimensions,
>> then,
>> lat(lat) and lon(lon) coordinate variables, and from there create a 3D
>> runoff(time,lat,lon) variable.
>> There are several hurdles though:
>> 1) The values of lat and lon in your CSV file may have repetitions (this
>> affects the actual lenght of each dimension, which may be <20000).
>> 2) The values of lat and lon in your CSV file may not be monotonically
>> ordered (either in increasing or decreasing order).
>> I didn't spot any repetitions in the sample file you sent (but the full
>> file may have repetitions),
>> but the lat and lon are definitely not monotonically ordered,
>> they can go up, then down, then up again ...
>> Bona fide coordinate variables must be monotonic.
>> 3) Even if you weed out repetitions in lat or lon, and sort them in
>> increasing or decreasing order,
>> you would have to exchange also the corresponding runoff values, so that
>> they continue to belong to the correct station/location/ID,
>> i.e. sort the whole file with (lat,lon) as primary and secondary keys.
>>
>> Maybe Python has a sort routine that does all that for you gracefully, some
>> variant of qsort perhaps.
>>
>> Gus
>>
>>
>> On Mon, Mar 18, 2019 at 6:50 PM Ken Mankoff <mankoff@xxxxxxxxx> wrote:
>>
>>> Hi Sourish, Gus, and Elizabeth,
>>>
>>> Thank you all for your suggestions. I think I've found something that
>>> works, except for one issue. Please excuse my likely incorrect use of
>>> terminology - being new to NetCDF creation I may say something incorrect,
>>> but I hope the data dump below speaks for itself.
>>>
>>> Because my data is 2D (time, ID), then those are the dimensions, and
>>> lon,lat,x,y become variables on the ID dimension. This means my standard
>>> netcdf tools for slicing based on spatial dimension don't work. For example,
>>>
>>> cdo sellonlatbox,83.5,85,-27,-28 ds.nc bar.nc
>>>
>>> or
>>>
>>> ncks -d lat,83.5,85 -d lon,-27,-28 ds.nc bar.nc
>>> # ncks: ERROR dimension lat is not in input file
>>>
>>> Is there a way to make the data 2D but have the 2nd dimension be
>>> (lon,lat)? Even if yes, I don't imagine the cdo and ncks tools would work
>>> on that dimension... Is there a cdo, nco, or ncks (or other) simple tool
>>> I'm missing that can work with this non-gridded data the way those tools do
>>> so easily work with gridded data?
>>>
>>>
>>> Anway, here is the Python xarray code I got working to produce the NetCDF
>>> file, reading in the 'foo.csv' from my previous email and generating ds.nc.
>>> Once I understood the NetCDF structure from the file Sourish provided, I
>>> was able to generate something similar using a higher level API - one that
>>> takes care of time units, calendar, etc. I leave out (x,y,elev) for brevity.
>>>
>>>
>>>   -k.
>>>
>>>
>>>
>>> df = pd.read_csv('foo.csv', index_col=0, header=[0,1,2,3,4,5])
>>> df.index = pd.to_datetime(df.index)
>>>
>>> # Build the dataset
>>> ds = xr.Dataset()
>>> ds['lon'] = (('ID'), df.columns.get_level_values('lon'))
>>> ds['lat'] = (('ID'), df.columns.get_level_values('lat'))
>>> ds['runoff'] = (('time', 'ID'), df.values)
>>> ds['ID'] = df.columns.get_level_values('ID')
>>> ds['time'] = df.index
>>>
>>> # Add metadata
>>> ds['lon'].attrs['units'] = 'Degrees East'
>>> ds['lon'].attrs['long_name'] = 'Longitude'
>>> ds['lat'].attrs['units'] = 'Degrees North'
>>> ds['lat'].attrs['long_name'] = 'Latitude'
>>> ds['runoff'].attrs['units'] = 'm^3/day'
>>> ds['ID'].attrs['long_name'] = 'Basin ID'
>>>
>>> ds.to_netcdf('ds.nc')
>>>
>>>
>>>
>>>
>>> And here is the ncdump of the file
>>>
>>>
>>>
>>>
>>>
>>> netcdf ds {
>>> dimensions:
>>>         ID = 10 ;
>>>         time = 5 ;
>>> variables:
>>>         string lon(ID) ;
>>>                 lon:units = "Degrees East" ;
>>>                 lon:long_name = "Longitude" ;
>>>         string lat(ID) ;
>>>                 lat:units = "Degrees North" ;
>>>                 lat:long_name = "Latitude" ;
>>>         double runoff(time, ID) ;
>>>                 runoff:_FillValue = NaN ;
>>>                 runoff:units = "m^3/day" ;
>>>                 runoff:long_name = "RACMO runoff" ;
>>>         string ID(ID) ;
>>>                 ID:long_name = "Basin ID" ;
>>>         int64 time(time) ;
>>>                 time:units = "days since 1980-01-01 00:00:00" ;
>>>                 time:calendar = "proleptic_gregorian" ;
>>>
>>> // global attributes:
>>>                 :Creator = "Ken Mankoff" ;
>>>                 :Contact = "kdm@xxxxxxx" ;
>>>                 :Institution = "GEUS" ;
>>>                 :Version = 0.1 ;
>>> data:
>>>
>>>  lon = "-27.983", "-27.927", "-27.894", "-28.065", "-28.093", "-28.106",
>>>     "-28.155", "-27.807", "-27.455", "-27.914" ;
>>>
>>>  lat = "83.505", "83.503", "83.501", "83.502", "83.501", "83.499",
>>> "83.498",
>>>     "83.485", "83.471", "83.485" ;
>>>
>>>  runoff =
>>>   0.023, 0.01, 0.023, 0.005, 0, 0, 0, 0, 0, 0,
>>>   0.023, 0.01, 0.023, 0.005, 0, 0, 0, 0, 0, 0,
>>>   0.024, 0.013, 0.023, 0.005, 0, 0, 0, 0, 0, 0,
>>>   0.025, 0.012, 0.023, 0.005, 0, 42, 0, 0, 0, 0,
>>>   0.023, 0.005, 0.023, 0.005, 0, 0, 0, 0, 0, 0 ;
>>>
>>>  ID = "1", "2", "5", "8", "9", "10", "12", "13", "15", "16" ;
>>>
>>>  time = 0, 1, 2, 3, 4 ;
>>> }
>>>
>>> _______________________________________________
>>> NOTE: All exchanges posted to Unidata maintained email lists are
>>> recorded in the Unidata inquiry tracking system and made publicly
>>> available through the web.  Users who post to any of the lists we
>>> maintain are reminded to remove any personal information that they
>>> do not want to be made public.
>>>
>>>
>>> netcdfgroup mailing list
>>> netcdfgroup@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit:
>>> http://www.unidata.ucar.edu/mailing_lists/
>>>

Attachment: signature.asc
Description: OpenPGP digital signature

References:
- [netcdfgroup] storing sparse matrices data in NetCDF
  - From: Ken Mankoff
- Re: [netcdfgroup] storing sparse matrices data in NetCDF
  - From: Sourish Basu
- Re: [netcdfgroup] storing sparse matrices data in NetCDF
  - From: Ken Mankoff
- Re: [netcdfgroup] storing sparse matrices data in NetCDF
  - From: Sourish Basu
- Re: [netcdfgroup] storing sparse matrices data in NetCDF
  - From: Ken Mankoff
- Re: [netcdfgroup] storing sparse matrices data in NetCDF
  - From: Sourish Basu
- Re: [netcdfgroup] storing sparse matrices data in NetCDF
  - From: Ken Mankoff
- Re: [netcdfgroup] storing sparse matrices data in NetCDF
  - From: Gus Correa
- Re: [netcdfgroup] storing sparse matrices data in NetCDF
  - From: Ken Mankoff