Re: [netcdfgroup] Unexpectedly large netCDF4 files from python

To: Chris Barker <chris.barker@xxxxxxxx>
Subject: Re: [netcdfgroup] Unexpectedly large netCDF4 files from python
From: Val Schmidt <vschmidt@xxxxxxxxxxxx>
Date: Tue, 5 Apr 2016 15:56:43 -0400

You guys are replying faster than I can keep up! (Which is awful nice of you!)

I was able to change the chunk size and get a file size that makes much more 
sense. With a chunk size of 1024, I get a file of 166kBytes. 

What are the units of chunk size by the way?

-Val

> On Apr 5, 2016, at 3:53 PM, Chris Barker <chris.barker@xxxxxxxx> wrote:
> 
> oh, and I've enclosed my code -- your didn't actually run -- missing imports?
> 
> 
> 
> 
> On Tue, Apr 5, 2016 at 12:52 PM, Chris Barker <chris.barker@xxxxxxxx 
> <mailto:chris.barker@xxxxxxxx>> wrote:
> 
> 
> On Tue, Apr 5, 2016 at 12:13 PM, Ted Mansell <ted.mansell@xxxxxxxx 
> <mailto:ted.mansell@xxxxxxxx>> wrote:
> You might check the ChunkSizes attribute with 'ncdump -hs'. The newer netcdf 
> sets larger default chunks than it used to. I had this issue with 1-d 
> variables that used an unlimited dimension. Even if the dimension only had a 
> small number, the default chunk made it much bigger.
> 
> I had the same issue -- 1-d variable had a chunksize of 1, which was really, 
> really bad!
> 
> But that doesn't seem to be the issue here -- I ran the same code, and get 
> the same results, and here is the dump:
> 
> netcdf text3 {
> types:
>   ubyte(*) variable_data_t ;
> dimensions:
>     timestamp_dim = UNLIMITED ; // (1 currently)
>     data_dim = UNLIMITED ; // (1 currently)
>     item_len = 100 ;
> variables:
>     double timestamp(timestamp_dim) ;
>         timestamp:_Storage = "chunked" ;
>         timestamp:_ChunkSizes = 524288 ;
>     variable_data_t data(data_dim) ;
>         data:_Storage = "chunked" ;
>         data:_ChunkSizes = 4194304 ;
>         data:_NoFill = "true" ;
> 
> // global attributes:
>         :_Format = "netCDF-4" ;
> }
> 
> if I read that right, nice big chunks.
> 
> note that if I do'nt use a VLType variable, I still get a 4MB file -- though 
> that could be the netcdf4 overhead:
> 
> netcdf text3 {
> types:
>   ubyte(*) variable_data_t ;
> dimensions:
>     timestamp_dim = UNLIMITED ; // (1 currently)
>     data_dim = UNLIMITED ; // (1 currently)
>     item_len = 100 ;
> variables:
>     double timestamp(timestamp_dim) ;
>         timestamp:_Storage = "chunked" ;
>         timestamp:_ChunkSizes = 524288 ;
>     ubyte data(data_dim, item_len) ;
>         data:_Storage = "chunked" ;
>         data:_ChunkSizes = 1, 100 ;
> 
> // global attributes:
>         :_Format = "netCDF-4" ;
> }
> 
> something is up with the VLen.....
> 
> -CHB
> 
> 
> 
>  
> (Assuming the variable is not compressed.)
> 
> -- Ted
> 
> __________________________________________________________
> | Edward Mansell <ted.mansell@xxxxxxxx <mailto:ted.mansell@xxxxxxxx>>
> | National Severe Storms Laboratory
> |--------------------------------------------------------------
> | "The contents of this message are mine personally and
> | do not reflect any position of the U.S. Government or NOAA."
> |--------------------------------------------------------------
> 
> On Apr 5, 2016, at 1:44 PM, Val Schmidt <vschmidt@xxxxxxxxxxxx 
> <mailto:vschmidt@xxxxxxxxxxxx>> wrote:
> 
> > Hello netcdf folks,
> >
> > I’m testing some python code for writing sets of timestamps and variable 
> > length binary blobs to a netcdf file and the resulting file size is 
> > perplexing to me.
> >
> > The following segment of python code creates a file with just two 
> > variables, “timestamp” and “data”, populates the first entry of the 
> > timestamp variable with a float and the corresponding first entry of the 
> > data variable with an array of 100 unsigned 8-bit integers. The total 
> > amount of data is 108 bytes.
> >
> > But the resulting file is over 73 MB in size. Does anyone know why this 
> > might be so large and what I might be doing to cause it?
> >
> > Thanks,
> >
> > Val
> >
> >
> > from netCDF4 import Dataset
> > import numpy
> >
> > f = Dataset('scratch/text3.nc <http://text3.nc/>','w')
> >
> > dim = f.createDimension('timestamp_dim',None)
> > data_dim = f.createDimension('data_dim',None)
> >
> > data_t = f.createVLType('u1','variable_data_t’)
> >
> > timestamp = f.createVariable('timestamp','d','timestamp_dim')
> > data = f.createVariable('data',data_t,'data_dim’)
> >
> > timestamp[0] = time.time()
> > data[0] = uint8( numpy.ones(1,100))
> >
> > f.close()
> >
> > ------------------------------------------------------
> > Val Schmidt
> > CCOM/JHC
> > University of New Hampshire
> > Chase Ocean Engineering Lab
> > 24 Colovos Road
> > Durham, NH 03824
> > e: vschmidt [AT] ccom.unh.edu <http://ccom.unh.edu/>
> > m: 614.286.3726 <tel:614.286.3726>
> >
> >
> > _______________________________________________
> > netcdfgroup mailing list
> > netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
> > For list information or to unsubscribe,  visit: 
> > http://www.unidata.ucar.edu/mailing_lists/ 
> > <http://www.unidata.ucar.edu/mailing_lists/>
> 
> 
> 
> 
> _______________________________________________
> netcdfgroup mailing list
> netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
> For list information or to unsubscribe,  visit: 
> http://www.unidata.ucar.edu/mailing_lists/ 
> <http://www.unidata.ucar.edu/mailing_lists/>
> 
> 
> 
> -- 
> 
> Christopher Barker, Ph.D.
> Oceanographer
> 
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959 <tel:%28206%29%20526-6959>   voice
> 7600 Sand Point Way NE   (206) 526-6329 <tel:%28206%29%20526-6329>   fax
> Seattle, WA  98115       (206) 526-6317 <tel:%28206%29%20526-6317>   main 
> reception
> 
> Chris.Barker@xxxxxxxx <mailto:Chris.Barker@xxxxxxxx>
> 
> 
> -- 
> 
> Christopher Barker, Ph.D.
> Oceanographer
> 
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
> 
> Chris.Barker@xxxxxxxx <mailto:Chris.Barker@xxxxxxxx><huge_nc_file.py>

------------------------------------------------------
Val Schmidt
CCOM/JHC
University of New Hampshire
Chase Ocean Engineering Lab
24 Colovos Road
Durham, NH 03824
e: vschmidt [AT] ccom.unh.edu
m: 614.286.3726

Follow-Ups:
- Re: [netcdfgroup] Unexpectedly large netCDF4 files from python
  - From: Chris Barker

References:
- [netcdfgroup] Unexpectedly large netCDF4 files from python
  - From: Val Schmidt
- Re: [netcdfgroup] Unexpectedly large netCDF4 files from python
  - From: Ted Mansell
- Re: [netcdfgroup] Unexpectedly large netCDF4 files from python
  - From: Chris Barker
- Re: [netcdfgroup] Unexpectedly large netCDF4 files from python
  - From: Chris Barker