Re: [netcdfgroup] NetCDF NC_CHAR file double the size of ASCII file

To: Dave Allured - NOAA Affiliate <dave.allured@xxxxxxxx>
Subject: Re: [netcdfgroup] NetCDF NC_CHAR file double the size of ASCII file
From: Timothy Stitt <Timothy.Stitt.9@xxxxxx>
Date: Tue, 20 May 2014 14:55:43 -0400

Thanks Dave and Chris etc.

Chunking with 1024 instead of 1 on the leading unlimited dimension cut one of 
my big test files down from 444MB to 380MB. Trying values either side of 1024 
didn’t make any improvement. Using deflate level 2 though, my file size 
dramatically dropped to 91MB. Higher levels only knocked off at most 10MB more. 
So your advice and suggestions definitely paid off. Thanks for educating me on 
the usage of NETCDF-4. I'm definitely understanding the various formats and 
options much better now.

Dave – thanks for the advice. I’ll go back and check the file creation routine 
to see if something is slightly skewed with the data prep and writes. In a way 
I hope there is, since I should get even better size reduction than what I am 
currently getting.

One last question if that is ok. When using chunking, do parallel reads read in 
data in sizes proportional to the chunk size? If so, I am assuming then that 
each of my MPI workers will read in multiple records (that compose a chunk) at 
one time?

Thanks again,

Tim.
______________________________________________
Tim Stitt PhD
User Support Manager (CRC)
Research Assistant Professor (Computer Science & Engineering)
Room 108, Center for Research Computing, University of Notre Dame, IN 46556
Email: tstitt@xxxxxx<mailto:tstitt@xxxxxx>

From: Dave Allured - NOAA Affiliate 
<dave.allured@xxxxxxxx<mailto:dave.allured@xxxxxxxx>>
Date: Tuesday, May 20, 2014 at 2:32 PM
To: CRC <timothy.stitt.9@xxxxxx<mailto:timothy.stitt.9@xxxxxx>>
Cc: "netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>" 
<netcdfgroup@xxxxxxxxxxxxxxxx<mailto:netcdfgroup@xxxxxxxxxxxxxxxx>>
Subject: Re: [netcdfgroup] NetCDF NC_CHAR file double the size of ASCII file

Tim,

You just said that your original ASCII file, plate.10000, is 2.2 Mbytes and 
contains 40000 lines.  This works out to an AVERAGE line length of 55 
characters.  However, in the sample ncdump you are allowing 87 characters per 
line.  This is a large increase in the amount of allocated space, actually 58% 
more.  This probably explains the majority of the difference between your ASCII 
and Netcdf versions.

Is it possible that your ASCII source data has varying length lines?  Netcdf 
character arrays require fixed length lines, unless you add the complication of 
line delimiters or index arrays.

Did you or some program chose 87 as the MAXIMUM line length?  If so, then the 
majority of the discrepancy is the storage of extra padding characters (blanks 
or nulls) to pad each data line out to 87 characters.

The simplest solution for this is probably to just go ahead and enable Netcdf-4 
compression.  Compression of long runs of padding characters should be 
particularly efficient in Netcdf-4.  This could be combined with Chris Barker's 
suggestion to experiment with chunk sizes.

--Dave

On Tue, May 20, 2014 at 10:46 AM, Timothy Stitt 
<Timothy.Stitt.9@xxxxxx<mailto:Timothy.Stitt.9@xxxxxx>> wrote:
>
> Thanks for the reply Rob.
>
> You were correct in suspecting I was using the classic format, which I was
> able to identify with your ncdump command. I then checked how to use the
> NETCDF-4 format instead and made the change to my write routine. I¹ve now
> got my NC file in NETCDF-4 format but I¹m still seeing the 2X file storage
> increase compared to my original ASCII file. Can you see any other
> problems with my file structure based on the ncdump command below?
>
> netcdf plate {
> dimensions:
>         Record_Lines = 4 ;
>         Line_Symbols = 87 ;
>         Record_Number = UNLIMITED ; // (11474 currently)
> variables:
>         char Record(Record_Number, Record_Lines, Line_Symbols) ;
>                 Record:_Storage = "chunked" ;
>                 Record:_ChunkSizes = 1, 4, 87 ;
>
> // global attributes:
>                 :_Format = "netCDF-4" ;
> }
>
> The files sizes are as follows:
>
> 2.2M May 13 16:03 plate.10000 (original ASCII file with 4*10000 lines -
> 10000 records, 4 lines per record)
> 4.5M May 20 12:38 plate.nc<http://plate.nc>
>
> Thanks in advance for your help,
>
> Tim.
>
>
> On 5/20/14, 11:43 AM, "Rob Latham" 
> <robl@xxxxxxxxxxx<mailto:robl@xxxxxxxxxxx>> wrote:
> >
> >On 05/19/2014 09:52 AM, Timothy Stitt wrote:
> >> Hi all,
> >>
> >> I¹ve been trying to convert a large (40GB) ASCII text file (composed of
> >> multiple records of 4 line ASCII strings about 90 characters long) into
> >> NetCDF format. My plan was to rewrite the original serial code to use
> >> parallel NetCDF to have many MPI processes concurrently read records and
> >> process them in parallel.
> >>
> >> I was able to write some code to convert the ASCII records into
> >> [unlimited][4][90] NetCDF NC_CHAR arrays, which I was able to read
> >> concurrently via parallel NetCDF routines. My question is related to the
> >> size of the converted NetCDF file.
> >>
> >> I notice that the converted NetCDF file is always double the size of the
> >> ASCII file whereas I was hoping for it be to much reduced. I was
> >> therefore wondering if this is expected or is more due to my bad
> >> representation in NetCDF of the ASCII records? I am using
> >> nc_put_vara_text() to write my records. Maybe I need to introduce
> >> compression that I¹m not doing already?
> >
> >Are you using the classic file format or the NetCDF-4 file format?
> >
> >Can you provide an ncdump -h of the new file?
> >
> >==rob
> >
> >>
> >> Thanks in advance for any advice you can provide.
> >>
> >> Regards,
> >>
> >> Tim.

References:
- [netcdfgroup] NetCDF NC_CHAR file double the size of ASCII file
  - From: Timothy Stitt
- Re: [netcdfgroup] NetCDF NC_CHAR file double the size of ASCII file
  - From: Rob Latham
- Re: [netcdfgroup] NetCDF NC_CHAR file double the size of ASCII file
  - From: Timothy Stitt
- Re: [netcdfgroup] NetCDF NC_CHAR file double the size of ASCII file
  - From: Dave Allured - NOAA Affiliate