Re: when will HDF5 support Unicode?

To: netcdf-hdf@xxxxxxxxxxxxxxxx
Subject: Re: when will HDF5 support Unicode?
From: "Robert E. McGrath" <mcgrath@xxxxxxxxxxxxx>
Date: Fri, 6 May 2005 12:31:50 -0500

Russ,

Here is the current plan to provide limited support for unicode forHDF5-1.8.0.


Specifically:

1. one new character encoding, UTF-8, will be added for user datatypes,i.e., datasets with string data. This is a straightforwardextension of

    the current string data types.

2. a new property when creating links (i.e., creating objects oradding

    to groups), to specify either ASCII or UTF-8.
   - default will be 'ASCII' (for backward compatibility)

- query will tell the encoding, one of (UNKNOWN, ASCII, UTF-8).Older

     files will return UNKNOWN
   - the link names will not be checked, i.e., we won't check that it
     is legal UTF-8.

Other unicode support will be considered at a later date.

On 2005.05.05 16:17 Russ Rew wrote:

> We've had several discussions of UTF-8 support.  The current ideas
are
> incorporated in a RFC at:
>
>    http://hdf.ncsa.uiuc.edu/RFC/Unicode/Unicode.html
>
> Close reading of this RFC will indicate that we know how to support
> UTF-8 for user data, but support for UTF-8 for names is still TBD.

I would consider supporting only UTF-8 for names but permit users to
specify other encodings as well for user data, for two reasons:

 - fixed-width encodings (like UCS2) permit quick access to the nth
   character in a string

 - other encodings may permit more compact representation than UTF-8
   for strings that contain a lot of non-ASCII characters

Joel Spolsky's column is a good introduction to some Unicode issues,
but I recommend this article for developers:

  http://www.w3.org/TR/charmod/

For example, the above gives examples of some of the complications in
sorting datasets alphabetically in a Group if you support Unicode
names.  You might need to use the "Unicode Collation Algorithm" in
that case.  Fortunately, there are open source implementations for
such
things in ICU (International Components For Unicode):

  http://icu.sourceforge.net/

--Russ

Follow-Ups:
- Re: when will HDF5 support Unicode?
  - From: Robert E. McGrath

References:
- Re: when will HDF5 support Unicode?
  - From: Robert E. McGrath
- Re: when will HDF5 support Unicode?
  - From: Russ Rew