NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
Hello, I am new to this mailing list and a scientist using netcdf rather than a programmer who could really contribute to the netCDF development. Since I am working with global atmospheric chemistry models which produce their output in netcdf format, I am all too aware of the missing compression capabilities of the netcdf library, unfortunately. Browsing the mailing list archive, I found out that someone had actually implemented some type of compression in 1998 and apparently this had reached a stage that it was about to be included in the "official" netcdf program package. Now I would like to know whether there are any active efforts to introduce packing into netcdf (and if so when to expect this). I would be happy to serve as a beta tester using files in the range of 10kB and 1GB uncompressed size. In case, no such activity is currently under way, I would like to contribute a few thoughts on this issue regarding backward compatibility as well as efficiency and usage. Please find these attached below. I hope that this mail is not inappropriate for the netcdf group. With kindest regards, Martin Schultz ------------------------ compression ideas: (1) from what I heard about the new bz2 compression, this should probably be the algorithm of choice, especially since it is patent free and licensed under the GPL. (2) a primitive packing version which would even maintain compatibility to elder netcdf versions could act on individual variables and use variable attributes to indicate whether this variable is compressed. A variable should only be compressed if it exceeds a certain size (e.g. 100kB). The method would then replace the existing dimension description of the variable with a single dimension indicating the number of bytes the compressed variable takes up. The variable type would be changed to BYTE. New variable attributes would be : _compressed_ = 1 ! logical to indicate that this variable is compressed _old_dims_ = integer array(10) ! holds original dimension indices _old_type_ ! holds original variable type Advantages: * every old program can still parse these files without modifications. * since the compressed variable is stored as byte vector in one piece, it can be saved 1:1 in a file and the data can be retrieved with the bzip2 command. One would get a raw binary file which is easily read with almost any software. * since relatively large data blocks will be compressed, compression should be effective in many real-world applications. Disadvantages: * storing or retrieving parts of the variable requires decompression of the complete variable data, adding extra data along the unlimited dimension requires decompression of the old variable, appending the new part and recompression along with a dimension change. * (thus) extremely inefficient for data sets with few huge variables or in multi-threaded environments (because output can only be done on one thread). (3) A true support of compressed data would require changes that render older netcdf versions incompatible. However, one should at least try to preserve the header format, so that existing programs can at least analyze what is in the file (I'm saying this naively without having looked at the netcdf source code). Perhaps one could introduce an extra layer that would manage the compression details. Compression could still be done on a per-variable basis and only for variables that exceed a threshold size. Then existing software could perhaps still read the uncompressed parts of the file. Variable compression should be broken down along the unlimited dimension, so that each "record" (time step etc.) would be compressed individually. Information about the individual block sizes would have to be stored in the extra layer. Advantages: * may maintain at least some backward compatibility * adding extra records is easy and relatively fast * retrieving individual records along the unlimited dimension is easy and fast Disadvantages: * writing or reading subranges of the fixed dimensions is cumbersome * because the entry point for compression is fixed (large variables along the unlimited dimension) one does probably not achieve the maximum possible compression ratio.
netcdfgroup
archives: