NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
I still think it's the same issue as we gave as we get file corruption problems of the *sao.gem files even when dcmetr isn't using all of the CPU. I know they are corrupted when users report GARP crashes when trying to load them. I know this isn't OS specific as you are running Linux and we are running Solaris x86. I upgraded to GEMPAK5.7.2p2 and had the same problem with dcmetr. I am running ldm-6.0.14 -----Original Message----- Sent: Wednesday, May 26, 2004 12:43 PM Cc: ldm-users@xxxxxxxxxxxxxxxx Hi... It looks like the LDM stopping issue is due to rtstats hanging due to a problem Unidata is working on right now. As for dcmetr, I actually have it running on a second system because I've been having file corruption problems on my first system. In today's case, dcmetr seemed hosed on the first system but okay on the second system. I say this just to indicate that I too have seen dcmetr file corruption problems but I'm still trying to figure out what the root cause might be. I don't think I see it using a lot of cpu, however, nor do I ever have to pkill it. Art. On Wed, 26 May 2004, Robert Mullenax wrote: > Yes, I have been having very similar problems with dcmetr (it would hang > and use all the CPU and produce corrupted .gem files). I reported it a week > > or so ago, but never got any responses from anyone having issues. > > I have to do a pkill -9 dcmetr > > > > -----Original Message----- > From: owner-ldm-users@xxxxxxxxxxxxxxxx > [mailto:owner-ldm-users@xxxxxxxxxxxxxxxx]On Behalf Of Arthur A. Person > Sent: Wednesday, May 26, 2004 11:22 AM > To: ldm-users@xxxxxxxxxxxxxxxx > Subject: Strange ldm/gempak behaviour > > > Hi... > > Thought I'd throw this out for comments... I just fixed (I think) a > strange LDM/Gempak problem: dcmetr was core dumping many times/minute, > yesterday's *sao.gem file was at the 4GB limit (actually, larger > 4488229376 bytes...???) but today's was ~4.5M thus far. I figured I would > stop/restart the ldm, but when I tried to stop it, one rpc and rtstats > wouldn't go down, so I had to kill them, remake the queues, and then > restart. Oddly enough, I have a second ldm running on another system that > also decodes metars (who's files seemed okay size-wise) that, when I tried > to stop its ldm, it also hung similarly and I had to kill/rebuild/restart > it as well. > > Anyone have any similar experience or could suggest a cause? I don't > recall that I've ever seen anything quite like this before. > > Thanks. > > Art. > -- Arthur A. Person Research Assistant, System Administrator Penn State Department of Meteorology email: person@xxxxxxxxxxxxxxxxxx, phone: 814-863-1563 >From owner-ldm-users@xxxxxxxxxxxxxxxx 26 2004 May -0600 13:05:38 Date: 26 May 2004 13:05:38 -0600 From: Steve Chiswell <chiz@xxxxxxxxxxxxxxxx> In-Reply-To: <69D72311B46FD4119EF100D0B77CF7DA0F08F52E@xxxxxxxxxxxxxxxxxxxxxx> To: Robert Mullenax <rmullenax@xxxxxxxxxxxx> Subject: 20040526: Strange ldm/gempak behaviour Received: (from majordo@localhost) by unidata.ucar.edu (UCAR/Unidata) id i4QJ5w5A014112 for ldm-users-out; Wed, 26 May 2004 13:05:58 -0600 (MDT) Received: from flip.unidata.ucar.edu (flip.unidata.ucar.edu [128.117.140.85]) by unidata.ucar.edu (UCAR/Unidata) with ESMTP id i4QJ5dtK014072; Wed, 26 May 2004 13:05:39 -0600 (MDT) Keywords: 200405261905.i4QJ5dtK014072 Received: (from chiz@localhost) by flip.unidata.ucar.edu (UCAR/8.12.5/Submit) id i4QJ5dkA3048003; Wed, 26 May 2004 13:05:39 -0600 (MDT) X-Authentication-Warning: flip.unidata.ucar.edu: chiz set sender to chiz@xxxxxxxxxxxxxxxx using -f Reply-To: chiz@xxxxxxxx Cc: "'Arthur A. Person'" <person@xxxxxxxxxxxxx>, ldm-users@xxxxxxxxxxxxxxxx, GEMPAK support <support-gempak@xxxxxxxxxxxxxxxx> References: <69D72311B46FD4119EF100D0B77CF7DA0F08F52E@xxxxxxxxxxxxxxxxxxxxxx> Content-Type: text/plain; charset=ISO8859-1 Content-Transfer-Encoding: 7bit Organization: Unidata Message-Id: <1085598338.2065374.368.camel@xxxxxxxxxxxxxxxxxxxxx> Mime-Version: 1.0 X-Mailer: Ximian Evolution 1.2.0 Sender: owner-ldm-users@xxxxxxxxxxxxxxxx Precedence: bulk Robert, As I mentioned to Art back when he raised the problem with corrupt data files: one possibility is that your surface file is possibly getting corrupted by pqact firing up more than one instance of the dcmetr decoder. This would happen if your file I/O backed up to a point where pqact failed to write into the open pipe, or if the decoder had become slowed to the point that pqact could not push down any more data to the open PIPE. The way to see if this is the case is to look in the dcmetr.log files for overlapping process id's for the same filename (date/time). Since the same stream is apparently working on a second system, the one issue to consider is whether the system loading is the same on both machines. Also, you can issue a "kill -USR2" twice to the pqact process that is responsible for running decoders such as dcmetr. This will put the pqact process into debug mode an output to your LDM log file information about how long it is taking the pqact process to process each product once it arrives in the local queue. These lines will be found with a pq_sequence "Delay" message. If your pqact process is overloaded, you would see increasing values on the order of hundereds to thousands of seconds. A well running LDM typically has values less than 1 second. (A third "kill -USR2" will take you out of debug mode, which is good since debug logging takes a lot of log file space). If your pqact is falling behind, one symptom would be that data doesn't show up on disk for some time even though you are receiving it in a timely fashion. In the $NAWIPS/ldm/etc/gen_pqact.csh script, I provide an option for creating separate configuration files to run multiple instances of pqact which will distribute the processing load of pqact (which is helpful in particular if you are fileing all NEXRAD or CRAFT data). As for rtstats, our LDM server that receives these moved its network recently which might have resulted in the process not exiting on shutdown. I will continue to monitor these issues. Steve Chiswell On Wed, 2004-05-26 at 11:48, Robert Mullenax wrote: > I still think it's the same issue as we gave as we get file corruption > problems > of the *sao.gem files even when dcmetr isn't using all of the CPU. I know > they are corrupted when users report GARP crashes when trying to load them. > > I know this isn't OS specific as you are running Linux and we are running > Solaris x86. I upgraded to GEMPAK5.7.2p2 and had the same problem with > dcmetr. I am running ldm-6.0.14 > > -----Original Message----- > From: Arthur A. Person [mailto:person@xxxxxxxxxxxxx] > Sent: Wednesday, May 26, 2004 12:43 PM > To: Robert Mullenax > Cc: ldm-users@xxxxxxxxxxxxxxxx > Subject: RE: Strange ldm/gempak behaviour > > > Hi... > > It looks like the LDM stopping issue is due to rtstats hanging due to a > problem Unidata is working on right now. > > As for dcmetr, I actually have it running on a second system because I've > been having file corruption problems on my first system. In today's case, > dcmetr seemed hosed on the first system but okay on the second system. I > say this just to indicate that I too have seen dcmetr file corruption > problems but I'm still trying to figure out what the root cause might be. > I don't think I see it using a lot of cpu, however, nor do I ever have to > pkill it. > > Art. > > On Wed, 26 May 2004, Robert Mullenax wrote: > > > Yes, I have been having very similar problems with dcmetr (it would hang > > and use all the CPU and produce corrupted .gem files). I reported it a > week > > > > or so ago, but never got any responses from anyone having issues. > > > > I have to do a pkill -9 dcmetr > > > > > > > > -----Original Message----- > > From: owner-ldm-users@xxxxxxxxxxxxxxxx > > [mailto:owner-ldm-users@xxxxxxxxxxxxxxxx]On Behalf Of Arthur A. Person > > Sent: Wednesday, May 26, 2004 11:22 AM > > To: ldm-users@xxxxxxxxxxxxxxxx > > Subject: Strange ldm/gempak behaviour > > > > > > Hi... > > > > Thought I'd throw this out for comments... I just fixed (I think) a > > strange LDM/Gempak problem: dcmetr was core dumping many times/minute, > > yesterday's *sao.gem file was at the 4GB limit (actually, larger > > 4488229376 bytes...???) but today's was ~4.5M thus far. I figured I would > > stop/restart the ldm, but when I tried to stop it, one rpc and rtstats > > wouldn't go down, so I had to kill them, remake the queues, and then > > restart. Oddly enough, I have a second ldm running on another system that > > also decodes metars (who's files seemed okay size-wise) that, when I tried > > to stop its ldm, it also hung similarly and I had to kill/rebuild/restart > > it as well. > > > > Anyone have any similar experience or could suggest a cause? I don't > > recall that I've ever seen anything quite like this before. > > > > Thanks. > > > > Art. > >
ldm-users
archives: