NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.

Re: [netcdf-java] Errors reading certain NetCDF4 data

  • To: Ryan May <rmay@xxxxxxxx>
  • Subject: Re: [netcdf-java] Errors reading certain NetCDF4 data
  • From: Christian Ward-Garrison <cwardgar@xxxxxxxx>
  • Date: Thu, 19 Feb 2015 15:34:17 -0700
Hi Antonio,

Our point about data layout stands, but if you still want to see what
performance benefits you can get by rechunking, I think you should use a
different shape. In his follow-up blog [1], Russ Rew provides a Python
function that calculates a good 3D chunk shape for your read pattern:

print chunk_shape_3D([5088,103,122], 4, 4096)
[21, 6, 8]

Maybe give that a try instead?

Cheers,
Christian

[1]
http://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_choosing_shapes

On Thu, Feb 19, 2015 at 12:45 PM, Ryan May <rmay@xxxxxxxx> wrote:

> Antonio,
>
> Sorry, I mispoke--time *should* be the last dimension, since for
> C-ordering, the last dimension will vary the fastest (i.e. items along this
> dimension will be sequential in memory). (I then got that crossed-up with
> your chunking description, which you're correct about.)
>
> It's possible for chunking to make up some of the performance difference,
> but you're never going to be as fast as just re-ordering the data. Russ
> Rew's example quoted times with chunking going from 200 seconds to 1.4
> seconds; his example had about 20x the amount of times he was getting.
> Given that you're quoting times of less than 1 second, I wonder if you're
> just not dominated by the seek time. Certainly, since you're on an SSD, the
> penalties for non-sequential access are much less than for disks.
>
> Ryan
>
> On Thu, Feb 19, 2015 at 11:48 AM, Antonio Rodriges <antonio.rrz@xxxxxxxxx>
> wrote:
>
>> Ryan,
>>
>> I do have time my first dimension (Christian suggested for time being
>> the last dimension)
>> and thought that after rechunking I get smth like this:
>>
>> 4x4 (lat and lon 2D array located continuously on disk), 4x4, 4x4,
>> 4x4, ......, 4x4
>> <<---------------------------- the number of rasters is 512
>> ---------------------------->>
>> so the distance between the different dates is not 8 kb but should be
>> only 4 x 4 x sizeof(float) = 64 bytes for the expected layout
>>
>> Here is the metadata (although without chunk sizes, is it possible to
>> look at the sizes?):
>>
>> netcdf
>> file:/d:/RS_DATA/worker/merra_ts/tavg1_2d_slv_Nx/wind_australia_chunked/u10m/chunked/
>> 2014_ch.nc
>> {
>>  dimensions:
>>    latitude = 103;
>>    longitude = 122;
>>    time = UNLIMITED;   // (5088 currently)
>>  variables:
>>    double latitude(latitude=103);
>>      :_Netcdf4Dimid = 0; // int
>>      :units = "degrees_north";
>>      :long_name = "Latitude";
>>    double longitude(longitude=122);
>>      :_Netcdf4Dimid = 1; // int
>>      :units = "degrees_east";
>>      :long_name = "Longitude";
>>    double time(time=5088);
>>      :_Netcdf4Dimid = 2; // int
>>      :units = "hours since 2014-1-1 0";
>>    float u10m(time=5088, latitude=103, longitude=122);
>>      :comments = "Unknown1 variable comment";
>>      :long_name = "Eastward wind at 10 m above displacement height";
>>      :units = "m s-1";
>>      :grid_name = "grid-1";
>>      :grid_type = "linear";
>>      :level_description = "Earth surface";
>>      :time_statistic = "instantaneous";
>>      :missing_value = 9.9999999E14f; // float
>>
>>  :Conventions = "COARDS";
>>  :calendar = "standard";
>>  :comments = "file created by grads using lats4d available from
>> http://dao.gsfc.nasa.gov/software/grads/lats4d/";;
>>  :model = "geos/das";
>>  :center = "gsfc";
>>  :history = "Mon Dec 01 20:20:48 2014:
>>
>> D:\\DATA\\worker\\merra_ts\\tavg1_2d_slv_Nx\\wind_australia\\u10m\\ncks.exe
>> -4 --cnk_dmn lat,4 --cnk_dmn lon,4 --cnk_dmn time,512 2014.nc
>> 2014_ch.nc\\nWed Oct 15 20:26:23 2014: ncrcat -v u10m -o 2014.nc";
>>  :nco_openmp_thread_number = 1; // int
>>  :nco_input_file_number = 212; // int
>>  :NCO = "20141201";
>> }
>>
>> 2015-02-19 21:24 GMT+03:00 Ryan May <rmay@xxxxxxxx>:
>> > Antonio,
>> >
>> > Even with that chunk size, the number of bytes between consecutive
>> points in
>> > time is 512 x 4 x sizeof(float), which is 8 kb. You may get a few points
>> > closer together, but they're still not close together. Any read ahead
>> > function of the disk will be throwing away 99% of the data if all you
>> want
>> > is all the time for a single point.
>> >
>> > If you're predominant access pattern is all times for a single point,
>> your
>> > best throughput will be achieved by making sure that those points are
>> > consecutive on disk, which means that you should have time be the first
>> > dimension, not the last. Anything else you do will be papering over the
>> core
>> > problem.
>> >
>> > Ryan
>> >
>> > On Thu, Feb 19, 2015 at 10:37 AM, Antonio Rodriges <
>> antonio.rrz@xxxxxxxxx>
>> > wrote:
>> >>
>> >> Christian,
>> >>
>> >> According to Russ Rew
>> >>
>> >>
>> http://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters
>> >> the chunking must help for my access pattern
>> >>
>> >> After rechunking I expected to have chunks with 512x4x4 sizes where
>> >> values for the single point and different time should be stored very
>> >> close on disk
>> >
>> >
>> >
>> >
>> > --
>> > Ryan May
>> > Software Engineer
>> > UCAR/Unidata
>> > Boulder, CO
>>
>
>
>
> --
> Ryan May
> Software Engineer
> UCAR/Unidata
> Boulder, CO
>
  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: