[ldm-users] LDM Weirdness

To: "'LDM Users'" <ldm-users@xxxxxxxxxxxxxxxx>
Subject: [ldm-users] LDM Weirdness
From: "Blair Trosper" <blair.trosper@xxxxxxxxx>
Date: Mon, 11 Jul 2011 02:08:21 -0500
We had an unusual incident on our main server in Florida tonight.
Everything was running smoothly (and the server had been up for several
months), when suddenly the traffic graphs showed the traffic to LDM halting
abruptly.  I happened to be paying attention for once and pulled the server
up on SSH, to find a litany of these messages in "ldmd.log"...(times given
in Central...this happened shortly before this e-mail, around 0600 GMT
Monday):

****************************************************************************
********************************
Jul 11 01:36:40 fl1 bigbird.tamu.edu[3174] ERROR: Disconnecting due to LDM
failure; nullproc_6 failure to bigbird.tamu.edu; RPC: Unable to receive;
errno = Connection reset by peer
Jul 11 01:36:40 fl1 feeds.michiganwxsystem.net [3177] ERROR: Disconnecting
due to LDM failure; nullproc_6
failure to feeds.michiganwxsystem.net; RPC: Unable to receive; errno =
Connection reset by peer
****************************************************************************
********************************
Jul 11 01:27:57 fl1 bigbird.tamu.edu[3174] NOTE: nullproc_6 failure to
bigbird.tamu.edu; RPC: Timed out
****************************************************************************
********************************
Jul 11 01:41:07 fl1 server1.wxalliance.com [20728] ERROR: Disconnecting due
to LDM failure; Couldn't connect
to LDM on server1.wxalliance.com using either port 388 or portmapper; : RPC:
Remote system error -
Connection timed out
Jul 11 01:41:28 fl1 bigbird.tamu.edu[20727] ERROR: Disconnecting due to LDM
failure; nullproc_6 failure to bigbird.tamu.edu; RPC: Unable to receive;
errno = Connection reset by peer
****************************************************************************
********************************
Jul 11 01:51:22 fl1 rpc.ldmd[3656] NOTE: local_portmapper_running():
clnttcp_create() failure: : RPC: Remote
system error - Connection refused
****************************************************************************
********************************

Sometimes it would curiously be followed by this:

****************************************************************************
********************************
Jul 10 00:03:56 fl1 bigbird.tamu.edu[20723] NOTE: LDM-6 desired
product-class: 20110710040356.413 TS_ENDT
{{IDS|DDPLUS,  ".*"}
****************************************************************************
********************************

...but then go right back to filling my logs with RPC errors and warnings.
To clarify, the server was not having issues with name resolution or
physically reaching those different networks.  However, even though the
Internet connectivity did not appear to be affected, both "ldmping" and
"notifyme" to all of our providers would fail with similar RPC errors and
complaints.

I began quoting Hermes ("we're jerked!") and poking at everything I could
find, only to have the problem spontaneously resolve about 45 minutes later.
Harumph! 

While I'm glad it's working, I'm curious to know if anyone has any ideas why
our server had a near total breakdown for the better part of 45 minutes.
I'm not sure if my eyes and brain just aren't catching on to the error given
the late hour...or if I am totally missing something obvious.  I like
problems that resolve themselves, but I always like to know the "why" for
next time.

Thanks!

Blair @ Weather Data/Austin TX
Follow-Ups:
- Re: [ldm-users] LDM Weirdness
  - From: Gilbert Sebenste