RE: Strange ldm/gempak behaviour

To: Robert Mullenax
Subject: RE: Strange ldm/gempak behaviour
From: Arthur A. Person [mailto:person@xxxxxxxxxxxxx]
Date: Wed, 26 May 2004 12:48:13 -0500
I still think it's the same issue as we gave as we get file corruption
problems
of the *sao.gem files even when dcmetr isn't using all of the CPU.  I know
they are corrupted when users report GARP crashes when trying to load them.

I know this isn't OS specific as you are running Linux and we are running
Solaris x86.  I upgraded to GEMPAK5.7.2p2 and had the same problem with
dcmetr.  I am running ldm-6.0.14

-----Original Message-----
Sent: Wednesday, May 26, 2004 12:43 PM
Cc: ldm-users@xxxxxxxxxxxxxxxx


Hi...

It looks like the LDM stopping issue is due to rtstats hanging due to a 
problem Unidata is working on right now.

As for dcmetr, I actually have it running on a second system because I've
been having file corruption problems on my first system.  In today's case,
dcmetr seemed hosed on the first system but okay on the second system.  I
say this just to indicate that I too have seen dcmetr file corruption
problems but I'm still trying to figure out what the root cause might be.  
I don't think I see it using a lot of cpu, however, nor do I ever have to
pkill it.

                                   Art.

On Wed, 26 May 2004, Robert Mullenax wrote:

> Yes, I have been having very similar problems with dcmetr (it would hang 
> and use all the CPU and produce corrupted .gem files).  I reported it a
week
> 
> or so ago, but never got any responses from anyone having issues.
> 
> I have to do a pkill -9 dcmetr
> 
> 
> 
> -----Original Message-----
> From: owner-ldm-users@xxxxxxxxxxxxxxxx
> [mailto:owner-ldm-users@xxxxxxxxxxxxxxxx]On Behalf Of Arthur A. Person
> Sent: Wednesday, May 26, 2004 11:22 AM
> To: ldm-users@xxxxxxxxxxxxxxxx
> Subject: Strange ldm/gempak behaviour
> 
> 
> Hi...
> 
> Thought I'd throw this out for comments...  I just fixed (I think) a
> strange LDM/Gempak problem:  dcmetr was core dumping many times/minute,
> yesterday's *sao.gem file was at the 4GB limit (actually, larger
> 4488229376 bytes...???) but today's was ~4.5M thus far.  I figured I would
> stop/restart the ldm, but when I tried to stop it, one rpc and rtstats
> wouldn't go down, so I had to kill them, remake the queues, and then
> restart.  Oddly enough, I have a second ldm running on another system that
> also decodes metars (who's files seemed okay size-wise) that, when I tried
> to stop its ldm, it also hung similarly and I had to kill/rebuild/restart
> it as well.
> 
> Anyone have any similar experience or could suggest a cause?  I don't
> recall that I've ever seen anything quite like this before.
> 
>                                 Thanks.
> 
>                                   Art.
> 

-- 
Arthur A. Person
Research Assistant, System Administrator
Penn State Department of Meteorology
email:  person@xxxxxxxxxxxxxxxxxx, phone:  814-863-1563
>From owner-ldm-users@xxxxxxxxxxxxxxxx 26 2004 May -0600 13:05:38 
Date: 26 May 2004 13:05:38 -0600
From: Steve Chiswell <chiz@xxxxxxxxxxxxxxxx>
In-Reply-To: <69D72311B46FD4119EF100D0B77CF7DA0F08F52E@xxxxxxxxxxxxxxxxxxxxxx>
To: Robert Mullenax <rmullenax@xxxxxxxxxxxx>
Subject: 20040526: Strange ldm/gempak behaviour
Received: (from majordo@localhost)
        by unidata.ucar.edu (UCAR/Unidata) id i4QJ5w5A014112
        for ldm-users-out; Wed, 26 May 2004 13:05:58 -0600 (MDT)
Received: from flip.unidata.ucar.edu (flip.unidata.ucar.edu [128.117.140.85])
        by unidata.ucar.edu (UCAR/Unidata) with ESMTP id i4QJ5dtK014072;
        Wed, 26 May 2004 13:05:39 -0600 (MDT)
Keywords: 200405261905.i4QJ5dtK014072
Received: (from chiz@localhost)
        by flip.unidata.ucar.edu (UCAR/8.12.5/Submit) id i4QJ5dkA3048003;
        Wed, 26 May 2004 13:05:39 -0600 (MDT)
X-Authentication-Warning: flip.unidata.ucar.edu: chiz set sender to 
chiz@xxxxxxxxxxxxxxxx using -f
Reply-To: chiz@xxxxxxxx
Cc: "'Arthur A. Person'" <person@xxxxxxxxxxxxx>, ldm-users@xxxxxxxxxxxxxxxx,
        GEMPAK support <support-gempak@xxxxxxxxxxxxxxxx>
References: 
         <69D72311B46FD4119EF100D0B77CF7DA0F08F52E@xxxxxxxxxxxxxxxxxxxxxx>
Content-Type: text/plain; charset=ISO8859-1
Content-Transfer-Encoding: 7bit
Organization: Unidata
Message-Id: <1085598338.2065374.368.camel@xxxxxxxxxxxxxxxxxxxxx>
Mime-Version: 1.0
X-Mailer: Ximian Evolution 1.2.0 
Sender: owner-ldm-users@xxxxxxxxxxxxxxxx
Precedence: bulk

Robert,

As I mentioned to Art back when he raised the problem with corrupt data
files:

one possibility is that your surface file is possibly getting corrupted
by pqact firing up more than one instance of the dcmetr decoder.
This would happen if your file I/O backed up to a point where pqact 
failed to write into the open pipe, or if the decoder had become
slowed to the point that pqact could not push down any more data to the 
open PIPE. The way to see if this is the case is to look in the 
dcmetr.log files for overlapping process id's for the same filename 
(date/time). Since the same stream is apparently working on a second
system, the one issue to consider is whether the system loading is the
same on both machines.

Also, you can issue a "kill -USR2" twice to the pqact process that
is responsible for running decoders such as dcmetr. This will put the
pqact process into debug mode an output to your LDM log file 
information about how long it is taking the pqact process to process 
each product once it arrives in the local queue. These lines will
be found with a pq_sequence "Delay" message. If your pqact process
is overloaded, you would see increasing values on the order of hundereds
to thousands of seconds. A well running LDM typically has values less
than 1 second. (A third "kill -USR2" will take you out of debug mode,
which is good since debug logging takes a lot of log file space).
If your pqact is falling behind, one symptom would be that
data doesn't show up on disk for some time even though
you are receiving it in a timely fashion. In the 
$NAWIPS/ldm/etc/gen_pqact.csh script, I provide an option for creating
separate configuration files to run multiple instances of pqact which
will distribute the processing load of pqact (which is helpful in 
particular if you are fileing all NEXRAD or CRAFT data).

As for rtstats, our LDM server that receives these moved its network 
recently which might have resulted in the process not exiting on
shutdown. I will continue to monitor these issues.

Steve Chiswell  



On Wed, 2004-05-26 at 11:48, Robert Mullenax wrote:
> I still think it's the same issue as we gave as we get file corruption
> problems
> of the *sao.gem files even when dcmetr isn't using all of the CPU.  I know
> they are corrupted when users report GARP crashes when trying to load them.
> 
> I know this isn't OS specific as you are running Linux and we are running
> Solaris x86.  I upgraded to GEMPAK5.7.2p2 and had the same problem with
> dcmetr.  I am running ldm-6.0.14
> 
> -----Original Message-----
> From: Arthur A. Person [mailto:person@xxxxxxxxxxxxx]
> Sent: Wednesday, May 26, 2004 12:43 PM
> To: Robert Mullenax
> Cc: ldm-users@xxxxxxxxxxxxxxxx
> Subject: RE: Strange ldm/gempak behaviour
> 
> 
> Hi...
> 
> It looks like the LDM stopping issue is due to rtstats hanging due to a 
> problem Unidata is working on right now.
> 
> As for dcmetr, I actually have it running on a second system because I've
> been having file corruption problems on my first system.  In today's case,
> dcmetr seemed hosed on the first system but okay on the second system.  I
> say this just to indicate that I too have seen dcmetr file corruption
> problems but I'm still trying to figure out what the root cause might be.  
> I don't think I see it using a lot of cpu, however, nor do I ever have to
> pkill it.
> 
>                                    Art.
> 
> On Wed, 26 May 2004, Robert Mullenax wrote:
> 
> > Yes, I have been having very similar problems with dcmetr (it would hang 
> > and use all the CPU and produce corrupted .gem files).  I reported it a
> week
> > 
> > or so ago, but never got any responses from anyone having issues.
> > 
> > I have to do a pkill -9 dcmetr
> > 
> > 
> > 
> > -----Original Message-----
> > From: owner-ldm-users@xxxxxxxxxxxxxxxx
> > [mailto:owner-ldm-users@xxxxxxxxxxxxxxxx]On Behalf Of Arthur A. Person
> > Sent: Wednesday, May 26, 2004 11:22 AM
> > To: ldm-users@xxxxxxxxxxxxxxxx
> > Subject: Strange ldm/gempak behaviour
> > 
> > 
> > Hi...
> > 
> > Thought I'd throw this out for comments...  I just fixed (I think) a
> > strange LDM/Gempak problem:  dcmetr was core dumping many times/minute,
> > yesterday's *sao.gem file was at the 4GB limit (actually, larger
> > 4488229376 bytes...???) but today's was ~4.5M thus far.  I figured I would
> > stop/restart the ldm, but when I tried to stop it, one rpc and rtstats
> > wouldn't go down, so I had to kill them, remake the queues, and then
> > restart.  Oddly enough, I have a second ldm running on another system that
> > also decodes metars (who's files seemed okay size-wise) that, when I tried
> > to stop its ldm, it also hung similarly and I had to kill/rebuild/restart
> > it as well.
> > 
> > Anyone have any similar experience or could suggest a cause?  I don't
> > recall that I've ever seen anything quite like this before.
> > 
> >                                 Thanks.
> > 
> >                                   Art.
> >
Follow-Ups:
- Odd ldm issue on LAN
  - From: Ray Weber
2004 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the ldm-users archives: