Re: [netcdf-java] Reading contiguous data in NetCDF files

To: "John Caron" <caron@xxxxxxxxxxxxxxxx>, "Joe Sirott" <Joe.Sirott@xxxxxxxx>
Subject: Re: [netcdf-java] Reading contiguous data in NetCDF files
From: "Jon Blower" <j.d.blower@xxxxxxxxxxxxx>
Date: Fri, 16 Jul 2010 16:35:31 +0100

Thanks John and Joe.  Yes, I do know that disk I/O is the limiting
factor, but optimising it isn't easy due to all the buffers and disk
caches (as you and Joe have pointed out).  Interestingly, I can "see"
these caches.  When I read random chunks of data from a file, sometimes
a read takes ~1ms, sometimes ~5ms and sometimes ~10ms, with not much in
between these values (a trimodal distribution).  I think these must be
three levels of caching.  Also, if I run the same test multiple times on
the same file, the number of 10ms reads drops off, and the number of 1ms
reads increases.  (I'm on a Windows XP laptop with a 5400 rpm hard
drive.)

 

I guess the only way to bypass the caches would be to cycle between a
large set of data files, which are in total bigger than the disk caches.
(I'm trying to simulate a busy server environment.)

 

By the way, I've been digging in the IOSPs and the ucar RandomAccessFile
class.  The ucar RAF seems to be the same as java.io.RAF except that it
implements an 8k buffer which is supposed to increase performance.  But
the code of N3raf (which extends N3iosp and I assume is the default
class used for data reading) uses raf.readToByteChannel(), which
bypasses the 8k buffer.  So could a java.io.RAF have been used in this
case?

 

To expand a little on my use case: in general, to create a
low-resolution map of data for a WMS, one has to read only a small
fraction of the available data in the file.  So I'm looking for an
efficient way to read sparse clouds of data (not evenly-spaced).
Reading point-by-point is not efficient, but nor is reading lots of
data, converting it to new types, then throwing most of it away.

 

Cheers, Jon

 

From: John Caron [mailto:caron@xxxxxxxxxxxxxxxx] 
Sent: 15 July 2010 20:00
To: Joe Sirott
Cc: Jon Blower; netcdf-java@xxxxxxxxxxxxxxxx
Subject: Re: [netcdf-java] Reading contiguous data in NetCDF files

 

Thanks Joe, I agree with your analysis. Its very hard to accurately time
I/O, because there are disk and OS caches etc. Netcdf-Java also caches
small variable data.

Netcdf-4 format is an order magnitude more complicated, with chunking
and compression and non-deterministic (perhaps order-dependent is a
better term) data placement. The most useful optimisation is to try to
make the commonly wanted subset fit inside of a single (or small number
of) chunks.

Jon, have you profiled your code and are sure that disk reading is the
bottleneck?




On 7/15/2010 11:39 AM, Joe Sirott wrote: 

Hi Jon,

Benchmarks like these can be quite tricky, due to the interaction of the
application with the OS. Unless you purge the OS page cache each time
you run your benchmark, your application (after the first test) isn't
reading data from disk but is instead copying data from the disk page
cache into local buffers, and the benchmark will likely be CPU bound and
execution time will be dominated by type conversion from raw buffered
data arrays into Java types. That would account for the strange results
you are seeing when reading 4K rather than 8K data chunks.

Also, for more info on netcdf-4 chunking/compression, Unidata has a nice
introduction at
http://hdfeos.org/workshops/ws13/presentations/day1/HDF5-EOSXIII-Advance
d-Chunking.ppt

Cheers, Joe

Jon Blower wrote: 

Hi John,
 
Thanks for this.
 
  

        netcdf-3 IOSP uses a bufferred RandomAccessFile implementation,
            

default 
  

        8096 byte buffer, which always reads 8096 bytes at a time. the
only 
        useful optimisation is to change the buffer size.
            

 
Good to know, thanks.  I would have thought that this would mean that
there's no point reading data of less than 8096 bytes.  But in my tests
I see that even below this value there's a linear relationship between
the size of data being read and the time to read the data (i.e. it's
quicker to read 4K than 8K).  I don't quite understand this.
 
Are there any specs for the NetCDF-4 format that I could read?  I'd like
to know more about how the data are compressed, and how much data
actually need to be read from disk to get a subset.
 
Cheers, Jon
 
-----Original Message-----
From: netcdf-java-bounces@xxxxxxxxxxxxxxxx
[mailto:netcdf-java-bounces@xxxxxxxxxxxxxxxx] On Behalf Of John Caron
Sent: 15 July 2010 00:26
To: netcdf-java@xxxxxxxxxxxxxxxx
Subject: Re: [netcdf-java] Reading contiguous data in NetCDF files
 
Hi Jon:
 
On 7/14/2010 2:51 PM, Jon Blower wrote:
  

        Hi,
         
        I don't know anything about how data in NetCDF files are
organized,
            

but
  

        intuitively, I would think that, for a general 2D array, the
data at
        points [j,i] and [j,i+1] would be contiguous on disk.  Is this
right?
        (i is the fastest-varying dimension)
           
            

 
yes, for variables in netcdf-3 files
 
  

        I might also suppose that, for an array of size [nj,ni], that
the data
        at points [j,ni-1] and [j+1,0] would also be contiguous.  Is
this
            

true?
  

           
            

yes, for variables in netcdf-3 files that dont use the unlimited
dimension
 
  

        If so, is there a method in Java-NetCDF that would allow me to
read
        these two points (and only these two points) in a single
operation?
           
            

 
netcdf-3 IOSP uses a bufferred RandomAccessFile implementation, default 
8096 byte buffer, which always reads 8096 bytes at a time. the only 
useful optimisation is to change the buffer size.
 
  

        (Background: I'm trying to improve the performance of ncWMS by
        optimising how data is read from disk.  This seems to involve
striking
            

a
  

        balance between the number of individual read operations and the
size
            

of
  

        each read operation.)
         
        Thanks,
        Jon
         
        --
        Dr Jon Blower
        Technical Director, Reading e-Science Centre
        Environmental Systems Science Centre
        University of Reading
        Harry Pitt Building, 3 Earley Gate
        Reading RG6 6AL. UK
        Tel: +44 (0)118 378 5213
        Fax: +44 (0)118 378 6413
        j.d.blower@xxxxxxxxxxxxx
        http://www.nerc-essc.ac.uk/People/Staff/Blower_J.htm
         
         
        _______________________________________________
        netcdf-java mailing list
        netcdf-java@xxxxxxxxxxxxxxxx
        For list information or to unsubscribe, visit:
            

http://www.unidata.ucar.edu/mailing_lists/
  

           
            

 
_______________________________________________
netcdf-java mailing list
netcdf-java@xxxxxxxxxxxxxxxx
For list information or to unsubscribe, visit:
http://www.unidata.ucar.edu/mailing_lists/ 
 
_______________________________________________
netcdf-java mailing list
netcdf-java@xxxxxxxxxxxxxxxx
For list information or to unsubscribe, visit:
http://www.unidata.ucar.edu/mailing_lists/

Follow-Ups:
- Re: [netcdf-java] Reading contiguous data in NetCDF files
  - From: John Caron

References:
- [netcdf-java] Reading contiguous data in NetCDF files
  - From: Jon Blower
- Re: [netcdf-java] Reading contiguous data in NetCDF files
  - From: John Caron
- Re: [netcdf-java] Reading contiguous data in NetCDF files
  - From: Jon Blower
- Re: [netcdf-java] Reading contiguous data in NetCDF files
  - From: Joe Sirott
- Re: [netcdf-java] Reading contiguous data in NetCDF files
  - From: John Caron