Re: Packing vs. Compression

To: support@xxxxxxxxxxxxxxxx
Subject: Re: Packing vs. Compression
From: Ethan Alpert <ethan@xxxxxxxxxxxx>
Date: Wed, 16 Aug 2000 10:22:11 -0600 (MDT)
> 
> PACKING is a particular kind of compression that reduces the number of
> bits per symbol by relaxing the requirement that symbols be aligned with
> storage unit boundaries.  An example would be packing 3 10-bit AVHRR
> samples into 1 32-bit word, as opposed to using 3 16-bit words.  Packing
> may be either lossless or lossy, although when used without
> qualification it usually means lossless packing (I can't imagine a
> situation where lossy packing would be the most appropriate compression
> scheme.)

I wouldn't go as far as to say lossy packing is not appropriate.

The GRIB format's packing is an example of "lossy packing" which is very 
appropriate. I have to assume since this is the netCDF group we're talking 
about scientific data, ie floating point numbers. 

IEEE for example is not exact yet we represent numbers using it all the time.
 
IEEE floating point actually stores two separate values, a mantissa and an 
exponent which when combined are used to represent floating point numbers
by using the following formula.

F = m * 2 ** e

The mantissa is 23 bits, the exponent is 8 bits and there is a sign bit.
For simplification I'm not going to talk about the hidden bit.
Anyhow another way to look at this is that there (2**23)  possible values 
for the mantissa, (2**8) values for the sign bit, and (2**1). This yields 
(2**23)*(2**8)*(2**1) total *discrete* values that IEEE (minus the hidden bit) 
can represent with the widest possible range of values. 

The reason I'm discussing this is to make the following point: 
Any range of numbers can be represented by a specific discrete set of 
integers to within a certain number of significant digits.

Now consider the following example of a GRIB like packing.
What GRIB does is take advantage of the fact that a specific grid of values 
will not span the entire range of floating point values and in fact the max
and min values of the data will only vary at the most by a few of orders
of magnitude. 

Lets say we want to packing a floating point field into 8 bit values. This
will allow us to represent the data to 2-3 significant digits in base ten.
2-3 digits can be enough to represent data for graphics or simple analysis
without a problem.

First find the min and max.

        var_min = min(var)
        var_max = max(var)

Second subtract the min from the entire set.

        var_normalized  = var - var_min
        add_offset = var_min

The max value of var_normalized now defines the scale.

        max_normalized = max(var_normalized)

Now the next step is the complicated one. We need to scale var_normalized into
values from 0 <= x < (2**8). This is done with the following operation:

        scale_factor= (2**7)/(2**ceil(log(max_normalized)/log(2.0)))

The log(max_normalized) gives you the number of orders of magnitude (base 10)
of the data. Dividing by log of 2 gives you the exponent in base 2. Taking 
the ceiling of this number gives you maximum order of magnitude of the data
in base two. The ratio 2**7/(2**...) when multiplied by var_normalized will
yield values in a range (0 <= x <(2**8)) .  

        output_values = var_normalized * scale_factor

Now you just have to coerce the output_values array to an integer and mask
the low order bits. In the output file you write the bytes and put attributes
add_offset and scale_factor. To unpack the data you need only multiply by
1.0/scale_factor  and add the add_offset.

This is a lossy packing algorithm but can very useful. How many significant
digits do you need to make a contour plot out of data or do elementary analysis 
(avg's, std deviations, differences etc.) Some data is made from objective
analysis of irregular data, like station observations. I doubt this data is
ever valid to more than 2 significant digits why store 7?


-ethan


Then you subtract the
min value for each element of the array and then divide by the scale of the
valid range, coerce the data to integers and mask the number of bits you want.

This gives you the ability to unpack the data by multiplying the integer
component by a scale_factor and adding an additive_offset a straight forward
linear time operation. 



> 
> Regards,
> /Frew
> 
> #======================================================================
> # James Frew     frew@xxxxxxxxxxxxx      http://www.bren.ucsb.edu/~frew/
> # School of Environmental Science and Management     +1.805.893.7356 vox
> # University of California, Santa Barbara, CA 93106-5131       .7612 fax
>
References:
- Re: Packing vs. Compression
  - From: Jim Frew
2000 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdfgroup archives: