NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
NOTE: The netcdf-hdf
mailing list is no longer active. The list archives are made available for historical reasons.
Hi Russ, > I'd like to reconsider the Unicode issue, and specifically ask about > the feasibility of what we hope is a small addition to HDF5 to allow > netCDF to support UTF-8 encoded names for variables, dimensions, and > attributes without HDF5 having to support such encoded names. > > We would like to just declare in netCDF documentation that the names > for netCDF variables, dimensions, and attributes are UTF-8 encoded > when provided to or returned from netCDF interfaces. This is > backwards compatible, because we currently only support ASCII strings > (with some restrictions), and what we're proposing would just remove > the restrictions and allow non-ASCII bytes (with the upper bit set), > to allow for UTF-8 encoding of other Unicode characters. > > What we would need from HDF5 is a way to request that names for > Datasets and Attributes allow an arbitrary byte array, so we can use > UTF-8 encoding for non-ASCII characters. > > Is this feasible? After rooting through the group API as much as I have recently, I think it's probably quite feasible for the names of object & attributes to use UTF-8 encoding for their strings. There are only two hangups I can see: - The names will be sorted in byte-value order, since there's no locale information embedded in the file, which may disconcert international users. - The strings are nul-terminated and I'm not certain if part of a UTF-8 string can be nul. I'll write some tests that check for proper insertion of non-ASCII strings as object & attribute names and let you know what I find out. Note that Unicode strings as elements of a dataset is harder and probably won't work correctly currently. Quincey > Otherwise there are no library changes in netCDF that we would need to > support UTF-8 encoding for Unicode names. Some applications such as > ncdump and ncgen will have to know how to handle encoded names, but we > are willing to deal with that. > > Note that we're not requesting that you drop restrictions on all > names, just that you provide a way for netCDF-4 to be able to use > names with non-ASCII bytes, for example a call to a function that says > checking on new names will subsequently lenient (e.g. you could still > disallow empty names, names with embedded null characters, or names > that are too long). Existing code that didn't invoke this call would > still have to abide by the current name restrictions. > > Also I notice that the documentation for H5Acreate and H5Dcreate at > > http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5A.html#Annot-Create > http://hdf.ncsa.uiuc.edu/HDF5/doc/RM_H5D.html#Dataset-Create > > currently list no restrictions on names to use only ASCII characters, > but the Introduction to HDF5 says > > A dataset name is a sequence of alphanumeric ASCII characters. > > --Russ >
netcdf-hdf
archives: