Re: [netcdf-java] point data

To: Rich Signell <rsignell@xxxxxxxx>
Subject: Re: [netcdf-java] point data
From: John Caron <caron@xxxxxxxxxxxxxxxx>
Date: Mon, 25 Jan 2010 10:21:40 -0700

Rich Signell wrote:

NetCDF-Java folk,

I'm trying to figure out how best to store the Global and US "Surface
summary of day data" at:

    http://www.ncdc.noaa.gov/oa/climate/climatedata.html#daily

in NetCDF format with the CDM Point Feature type conventions:

     http://www.unidata.ucar.edu/software/netcdf-java/CDM/CFpoints.html

This is daily-averaged surface data (temp, air pressure, etc) that
starts in 1929 with just a few stations, and now has thousands of
global stations.   It's stored on a ftp site with directories for
each year which containing gzip compressed text files, one for each
station.   The files in the 2010 directory are replaced every few days
with new updated files.

In present form the compressed text files take up 2.9GB, but if we
made a single NetCDF file with 22 vars x 81 years x 10,000 stations it
would be 29TB without compression.

So looking at the Point Data specs, it seems we could take several approaches:

1. Write with fixed time,station dimensions, fill missing values with
NaN, and use the NetCDF4 deflation.
2. Use 5.8.2.2 Ragged array (contiguous) representation
3. Use 5.8.2.3 Ragged array (non-contiguous) representation

since the records in the  files are updated regularly, perhaps option
2 is out, so I'm leaning toward option 3, in which you have just one
dimension for the each data variable and write all the station data
into it, but you have another variable which specifies the station ID
it corresponds to.

Does this sound right?

Thanks,
Rich


Hi Rich and all:

This is a interesting challenge on such a large datasets to get good read response.

First, you have to decide what kinds of queries you want to support and what 
kind of response time is needed.  I have generally used the assumption that the 
common queries that you want to optimize are:
 1) get data over a time range for all stations in a lat/lon box.
 2) get data for a single station over a time range, or for all time.

Usually I would break the data into multiple files based on time range, aiming for a file size of 50-500 Mb. I also use a different format for current vs archived data, so that the current dataset can be added to dynamically, while the archived data is rewritten (once) for speed of retrieval.

Again, all depends on what queries you want to optimize so ill wait for your 
thoughts on that.

Another question is what clients need to access this data. Are you writing your 
own web service, do you just want remote access from IDV, or ??

I would think that if we're careful, we can get netcdf-4 sizes that are similar 
to compressed text, but we'll have to experiment. The data appears to be 
integer or float with a fixed dynamic range, which is amenable to storing as an 
integer with scale/offset. integer data compresses much better than floating 
point due to the noise in the low bits of the mantissa. So one task you should 
get started on is to examine each field and decide its data type. if floating 
point, decide on its range and the number of significant bits.

Follow-Ups:
- Re: [netcdf-java] point data
  - From: Lauren E Hay

References:
- Re: [netcdf-java] point data
  - From: Rich Signell

2010 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdf-java archives: