Re: [netcdfgroup] Concurrent writes to netcdf3, what goes wrong?

To: Heiko Klein <Heiko.Klein@xxxxxx>
Subject: Re: [netcdfgroup] Concurrent writes to netcdf3, what goes wrong?
From: Wei-keng Liao <wkliao@xxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 22 Sep 2015 09:58:32 -0500

Your code fragment looks fine.

You can give PnetCDF a try to see if there is any difference.
In general, independent I/O mode performs worse than collective.
This is because the underlying file systems handle poorly for
noncontiguous file access resulted from independent mode.

If you can find a way to write a whole variable at a time,
using collective mode will significantly improve the performance.

Wei-keng

On Sep 22, 2015, at 9:43 AM, Heiko Klein wrote:

> Hi Wei-keng,
> 
> thanks for the information. I got now the parallel version working with
> NC_INDEPENDENT and from 2 processors, I see some benefit, e.g. real-time
> reduction from 40s to 30s. But when I add more processors performance
> get worse.
> 
> I'm using parallel hdf5  with an uncompressed netcdf4 file. Is
> NC_INDEPENDENT faster with netcdf3 pNetcdf?
> 
> 
> What I'm basically doing now is:
> 
> for (size_t vi = 0; vi < vars.size(); ++vi) {
>   if ((vi % mifi_mpi_size) != mifi_mpi_rank) {
>       continue; // skipping
>   }
>   check(nc_var_par_access(ncId, varId[vi], NC_INDEPENDENT));
>   //... read data for the variable from grib
>   check(nc_put_vara_float(ncId, varId[vi], start, count, data));
> }
> 
> Maybe there are better ways to do that?
> 
> 
> Best regards,
> 
> Heiko
> 
> 
> On 2015-09-22 09:07, Wei-keng Liao wrote:
>> Hi, Heiko
>> 
>> In that case, you can use independent mode.
>> I.e. nc_var_par_access(ncid, varid, NC_INDEPENDENT)
>> 
>> It still allows you to write to a shared file from multiple
>> MPI processes independently, at different time.
>> 
>> However, the performance will not be as good as the collective mode.
>> 
>> Wei-keng
>> 
>> On Sep 22, 2015, at 1:45 AM, Heiko Klein wrote:
>> 
>>> Hei Wei-keng,
>>> 
>>> thanks for your tip about using pnetcdf. I've worked with MPI, but only
>>> for modeling, i.e. when all processes do approximately the same thing at
>>> the same time.
>>> 
>>> The problem here is that the 10 input-files don't appear on my machines
>>> at the same time. They are ensemble members and downloaded from
>>> different machines with different processors, so the first file might
>>> appear 30s before the last file (within a total time-step time of 2
>>> minutes). I would like to start as soon as the first file appears, but
>>> this sounds very difficult with MPI, isn't it? (I'm more familiar with
>>> OpenMP, and there exist task-based parallelization (what I would use
>>> here), and loop-base parallelization (which is more like MPI?))
>>> 
>>> Best regards,
>>> 
>>> Heiko
>>> 
>>> On 2015-09-22 03:24, Wei-keng Liao wrote:
>>>> Hi, Heiko
>>>> 
>>>> Parallel I/O to the classical netCDF format is supported by netCDF through 
>>>> PnetCDF underneath.
>>>> It allows you to write concurrently to a single shared file from multiple 
>>>> MPI processes.
>>>> Of course, you will have to build PnetCDF first and then build netCDF with 
>>>> --enable-pnetcdf configure option.
>>>> 
>>>> Your netCDF program does not need much changes to make use this feature. 
>>>> All you have to
>>>> do is the followings.
>>>> 1. call nc_create_par() instead of nc_create()
>>>> 2. add NC_PNETCDF to the create mode argument of nc_create_par
>>>> 3. call nc_var_par_access(ncid, varid, NC_COLLECTIVE) after nc_enddef to 
>>>> enable collective I/O mode
>>>> 
>>>> There are a couple example codes available in this URL.
>>>> http://cucis.ece.northwestern.edu/projects/PnetCDF/#InteroperabilityWithNetCDF4
>>>> 
>>>> There are instructions in each example file for building netCDF with 
>>>> PnetCDF.
>>>> For downloading PnetCDF, please see 
>>>> http://cucis.ece.northwestern.edu/projects/PnetCDF/download.html
>>>> 
>>>> Wei-keng
>>>> 
>>>> On Sep 21, 2015, at 9:14 AM, Heiko Klein wrote:
>>>> 
>>>>> Hi Nick,
>>>>> 
>>>>> yes, they are all writing to the same file - we want to have one file at
>>>>> the end.
>>>>> 
>>>>> I've been scanning through the source-code of netcdf3. I guess the
>>>>> problem of the partly written sections is caused by the translation of
>>>>> the nc_put_vara calls to internal pages, and the from the internal pages
>>>>> to disk. And eventually, the internal pages are not aligned with my
>>>>> nc_put_vara calls, so even when the region of nc_put_vara doesn't
>>>>> overlap between concurrent calls, the internal pages do? Is there a way
>>>>> to enforce proper alignment? I see nc__enddef has several align 
>>>>> parameters.
>>>>> 
>>>>> 
>>>>> I'm aware that concurrent writes are not officially supported by the
>>>>> netcdf-library. But IT-infrastructure has changed a lot since the start
>>>>> of the netcdf-library and systems are nowadays highly parallelized, both
>>>>> on CPU and also in IO/filesystems. I'm trying to find a way to allow for
>>>>> simple parallelization. Having many output-files from a model is risky
>>>>> for data-consistency - so I would like to avoid it without sacrificing
>>>>> to much speed.
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> Heiko
>>>>> 
>>>>> 
>>>>> On 2015-09-21 15:18, Nick Papior wrote:
>>>>>> So, are they writing to the same files?
>>>>>> 
>>>>>> I.e. job1 writes a(:,1) to test.nc <http://test.nc> and job2 writes
>>>>>> a(:,2) to test.nc <http://test.nc>?
>>>>>> Because that is not allowed.
>>>>>> 
>>>>>> 2015-09-21 15:13 GMT+02:00 Heiko Klein <Heiko.Klein@xxxxxx
>>>>>> <mailto:Heiko.Klein@xxxxxx>>:
>>>>>> 
>>>>>>  Hi,
>>>>>> 
>>>>>>  I'm trying to convert about 90GB of NWP data 4 times daily from grib to
>>>>>>  netcdf. The grib-files arrive as fast as the data can be downloaded from
>>>>>>  the HPC machines. They come by 10 files/forecast timestep.
>>>>>> 
>>>>>>  Currently, I manage to convert 1 file/forecast timestep and I would like
>>>>>>  to parallelize the conversion into independent jobs (i.e. neither MPI or
>>>>>>  OpenMP), with a theoretical performance increase of 10. The underlying
>>>>>>  IO system is fast enough to handle 10 jobs, and I have enough CPUs, but
>>>>>>  the concurrently written netcdf-files show data which is only written
>>>>>>  half to the disk, or mixed with other slices.
>>>>>> 
>>>>>>  What I do is create a _FILL_VALUE 'template' file, containing all
>>>>>>  definitions before the NWP job runs. When a new set of files arrives,
>>>>>>  the data is put to the respective data-slices which don't have any
>>>>>>  overlap, there is never a redefine, only functions like: nc_put_vara_*
>>>>>>  with different slices.
>>>>>> 
>>>>>>  Since the nc_put_vara_* calls are non-overlapping, I hoped that this
>>>>>>  type of concurrent write would work - but it doesn't. Is my idea really
>>>>>>  so bad to write data in parallel (e.g. there are internal buffers which
>>>>>>  are rewritten)? Any ideas how to improve the conversion process?
>>>>>> 
>>>>>>  Best regards,
>>>>>> 
>>>>>>  Heiko
>>>>>> 
>>>>>>  _______________________________________________
>>>>>>  netcdfgroup mailing list
>>>>>>  netcdfgroup@xxxxxxxxxxxxxxxx <mailto:netcdfgroup@xxxxxxxxxxxxxxxx>
>>>>>>  For list information or to unsubscribe,  visit:
>>>>>>  http://www.unidata.ucar.edu/mailing_lists/
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Kind regards Nick
>>>>> 
>>>>> -- 
>>>>> Dr. Heiko Klein                   Norwegian Meteorological Institute
>>>>> Tel. + 47 22 96 32 58             P.O. Box 43 Blindern
>>>>> http://www.met.no                 0313 Oslo NORWAY
>>>>> 
>>>>> _______________________________________________
>>>>> netcdfgroup mailing list
>>>>> netcdfgroup@xxxxxxxxxxxxxxxx
>>>>> For list information or to unsubscribe,  visit: 
>>>>> http://www.unidata.ucar.edu/mailing_lists/ 
>>>> 
>>> 
>>> -- 
>>> Dr. Heiko Klein                   Norwegian Meteorological Institute
>>> Tel. + 47 22 96 32 58             P.O. Box 43 Blindern
>>> http://www.met.no                 0313 Oslo NORWAY
>> 
> 
> -- 
> Dr. Heiko Klein                   Norwegian Meteorological Institute
> Tel. + 47 22 96 32 58             P.O. Box 43 Blindern
> http://www.met.no                 0313 Oslo NORWAY

References:
- [netcdfgroup] Concurrent writes to netcdf3, what goes wrong?
  - From: Heiko Klein
- Re: [netcdfgroup] Concurrent writes to netcdf3, what goes wrong?
  - From: Nick Papior
- Re: [netcdfgroup] Concurrent writes to netcdf3, what goes wrong?
  - From: Heiko Klein
- Re: [netcdfgroup] Concurrent writes to netcdf3, what goes wrong?
  - From: Wei-keng Liao
- Re: [netcdfgroup] Concurrent writes to netcdf3, what goes wrong?
  - From: Heiko Klein
- Re: [netcdfgroup] Concurrent writes to netcdf3, what goes wrong?
  - From: Wei-keng Liao
- Re: [netcdfgroup] Concurrent writes to netcdf3, what goes wrong?
  - From: Heiko Klein