Hi Russ,
> I'd like to reconsider the Unicode issue, and specifically ask about
> the feasibility of what we hope is a small addition to HDF5 to allow
> netCDF to support UTF-8 encoded names for variables, dimensions, and
> attributes without HDF5 having to support such encoded names.
>
> We would like to just declare in netCDF documentation that the names
> for netCDF variables, dimensions, and attributes are UTF-8 encoded
> when provided to or returned from netCDF interfaces. This is
> backwards compatible, because we currently only support ASCII strings
> (with some restrictions), and what we're proposing would just remove
> the restrictions and allow non-ASCII bytes (with the upper bit set),
> to allow for UTF-8 encoding of other Unicode characters.
>
> What we would need from HDF5 is a way to request that names for
> Datasets and Attributes allow an arbitrary byte array, so we can use
> UTF-8 encoding for non-ASCII characters.
>
> Is this feasible?
After rooting through the group API as much as I have recently, I think
it's probably quite feasible for the names of object & attributes to use UTF-8
encoding for their strings. There are only two hangups I can see:
- The names will be sorted in byte-value order, since there's no locale
information embedded in the file, which may disconcert international
users.
- The strings are nul-terminated and I'm not certain if part of a UTF-8
string can be nul.
I'll write some tests that check for proper insertion of non-ASCII strings
as object & attribute names and let you know what I find out.
Note that Unicode strings as elements of a dataset is harder and probably
won't work correctly currently.
Quincey
> Otherwise there are no library changes in netCDF that we would need to
> support UTF-8 encoding for Unicode names. Some applications such as
> ncdump and ncgen will have to know how to handle encoded names, but we
> are willing to deal with that.
>
> Note that we're not requesting that you drop restrictions on all
> names, just that you provide a way for netCDF-4 to be able to use
> names with non-ASCII bytes, for example a call to a function that says
> checking on new names will subsequently lenient (e.g. you could still
> disallow empty names, names with embedded null characters, or names
> that are too long). Existing code that didn't invoke this call would
> still have to abide by the current name restrictions.
>
> Also I notice that the documentation for H5Acreate and H5Dcreate at
>
> http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5A.html#Annot-Create
> http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5D.html#Dataset-Create
>
> currently list no restrictions on names to use only ASCII characters,
> but the Introduction to HDF5 says
>
> A dataset name is a sequence of alphanumeric ASCII characters.
>
> --Russ
>