Re: [netcdfgroup] how to extract a varibale from many netcdf files

To: thanhwru <thanhwru83@xxxxxxxxx>
Subject: Re: [netcdfgroup] how to extract a varibale from many netcdf files
From: Russ Rew <russ@xxxxxxxxxxxxxxxx>
Date: Mon, 05 Mar 2012 10:22:40 -0700

Hi Thanh,

> - i want to extract a particular variable from many NetCDF files (these
> files are the same but different time, for example at 00, 06, 12, 18, 24
> hour) by the way of automatic, for example, i get a variable of temperature
> from about 1000 NetCDF file,
> so can you tell me how  to extract temperature variable by the shortest
> time?

It depends on the format of the 1000 input files, the complexity of each
file (in terms of the number of variables and attributes), and whether
this is just something you want to do once or the same 1000 files will
be accessed many times for similar tasks.

If the files are not particularly complex, and you just want to extract
all the temperature data once, I think the fastest way is to just loop
through the files, opening them one at a time and reading the desired
data.  The time this takes may be dominated by the time to open each
file, which for either netCDF-3 or -4 involves reading all the metadata
into memory when the file is opened.  If there are lots of variables and
attributes that have nothing to do with the temperature data you want,
this may take a while.

You can either write a program to do this or make use of one of the
packages designed for such tasks, such as NCO, NCL, or CDO.  For
descriptions and links, see the list of software for manipulating or
displaying netCDF data:

  http://www.unidata.ucar.edu/netcdf/software.html

An alternative may be advisable if there are a lot of other variables
with a lot of attributes in each file and you want to support other
similar data queries efficiently for future users.  In that case you
might want to first convert netCDF classic or 64-bit offset files into
netCDF-4 classic model files, something that can be done easily by the
nccopy utility.  This will take a lot of time once, but after that you
can take advantage of a feature of HDF5 access.  Reading the desired
data with HDF5 may not be any faster, but opening each file will be
significantly faster, because HDF5 only reads the metadata when it's
needed, so won't spend any time building a schema for all the variables
and their attributes on open.  However, you would actually have to use
the HDF5 library to get this efficiency, as the netCDF-4 library still
reads all the file metadata of a file on open.

A third alternative, if the archive of files will be accessed a great
many times with queries you can anticipate will each need to open lots
of files, is to reorganize the data to match the pattern of anticipated
queries.  For example, if the data is stored spatially, but will be
accessed as time series at a point, you may want to provide files
organized with the time axis varying most rapidly.  This can also be
accomplished with the nccopy utility in netCDF-4.2.  Other approaches
include a recent innovative use of Hilbert space-filling curves and the
Hadoop file system by Tanu Malik and colleagues:

  http://hpdgis.cigi.uiuc.edu/node/14

--Russ

References:
- [netcdfgroup] how to extract a varibale from many netcdf files
  - From: thanhwru