NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
Rich Signell wrote:
NetCDF-Java folk, I'm trying to figure out how best to store the Global and US "Surface summary of day data" at: http://www.ncdc.noaa.gov/oa/climate/climatedata.html#daily in NetCDF format with the CDM Point Feature type conventions: http://www.unidata.ucar.edu/software/netcdf-java/CDM/CFpoints.html This is daily-averaged surface data (temp, air pressure, etc) that starts in 1929 with just a few stations, and now has thousands of global stations. It's stored on a ftp site with directories for each year which containing gzip compressed text files, one for each station. The files in the 2010 directory are replaced every few days with new updated files. In present form the compressed text files take up 2.9GB, but if we made a single NetCDF file with 22 vars x 81 years x 10,000 stations it would be 29TB without compression. So looking at the Point Data specs, it seems we could take several approaches: 1. Write with fixed time,station dimensions, fill missing values with NaN, and use the NetCDF4 deflation. 2. Use 5.8.2.2 Ragged array (contiguous) representation 3. Use 5.8.2.3 Ragged array (non-contiguous) representation since the records in the files are updated regularly, perhaps option 2 is out, so I'm leaning toward option 3, in which you have just one dimension for the each data variable and write all the station data into it, but you have another variable which specifies the station ID it corresponds to. Does this sound right? Thanks, Rich
Hi Rich and all:This is a interesting challenge on such a large datasets to get good read response.
First, you have to decide what kinds of queries you want to support and what kind of response time is needed. I have generally used the assumption that the common queries that you want to optimize are: 1) get data over a time range for all stations in a lat/lon box. 2) get data for a single station over a time range, or for all time.Usually I would break the data into multiple files based on time range, aiming for a file size of 50-500 Mb. I also use a different format for current vs archived data, so that the current dataset can be added to dynamically, while the archived data is rewritten (once) for speed of retrieval.
Again, all depends on what queries you want to optimize so ill wait for your thoughts on that. Another question is what clients need to access this data. Are you writing your own web service, do you just want remote access from IDV, or ?? I would think that if we're careful, we can get netcdf-4 sizes that are similar to compressed text, but we'll have to experiment. The data appears to be integer or float with a fixed dynamic range, which is amenable to storing as an integer with scale/offset. integer data compresses much better than floating point due to the noise in the low bits of the mantissa. So one task you should get started on is to examine each field and decide its data type. if floating point, decide on its range and the number of significant bits.
netcdf-java
archives: