NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
We had an unusual incident on our main server in Florida tonight. Everything was running smoothly (and the server had been up for several months), when suddenly the traffic graphs showed the traffic to LDM halting abruptly. I happened to be paying attention for once and pulled the server up on SSH, to find a litany of these messages in "ldmd.log"...(times given in Central...this happened shortly before this e-mail, around 0600 GMT Monday): **************************************************************************** ******************************** Jul 11 01:36:40 fl1 bigbird.tamu.edu[3174] ERROR: Disconnecting due to LDM failure; nullproc_6 failure to bigbird.tamu.edu; RPC: Unable to receive; errno = Connection reset by peer Jul 11 01:36:40 fl1 feeds.michiganwxsystem.net [3177] ERROR: Disconnecting due to LDM failure; nullproc_6 failure to feeds.michiganwxsystem.net; RPC: Unable to receive; errno = Connection reset by peer **************************************************************************** ******************************** Jul 11 01:27:57 fl1 bigbird.tamu.edu[3174] NOTE: nullproc_6 failure to bigbird.tamu.edu; RPC: Timed out **************************************************************************** ******************************** Jul 11 01:41:07 fl1 server1.wxalliance.com [20728] ERROR: Disconnecting due to LDM failure; Couldn't connect to LDM on server1.wxalliance.com using either port 388 or portmapper; : RPC: Remote system error - Connection timed out Jul 11 01:41:28 fl1 bigbird.tamu.edu[20727] ERROR: Disconnecting due to LDM failure; nullproc_6 failure to bigbird.tamu.edu; RPC: Unable to receive; errno = Connection reset by peer **************************************************************************** ******************************** Jul 11 01:51:22 fl1 rpc.ldmd[3656] NOTE: local_portmapper_running(): clnttcp_create() failure: : RPC: Remote system error - Connection refused **************************************************************************** ******************************** Sometimes it would curiously be followed by this: **************************************************************************** ******************************** Jul 10 00:03:56 fl1 bigbird.tamu.edu[20723] NOTE: LDM-6 desired product-class: 20110710040356.413 TS_ENDT {{IDS|DDPLUS, ".*"} **************************************************************************** ******************************** ...but then go right back to filling my logs with RPC errors and warnings. To clarify, the server was not having issues with name resolution or physically reaching those different networks. However, even though the Internet connectivity did not appear to be affected, both "ldmping" and "notifyme" to all of our providers would fail with similar RPC errors and complaints. I began quoting Hermes ("we're jerked!") and poking at everything I could find, only to have the problem spontaneously resolve about 45 minutes later. Harumph! While I'm glad it's working, I'm curious to know if anyone has any ideas why our server had a near total breakdown for the better part of 45 minutes. I'm not sure if my eyes and brain just aren't catching on to the error given the late hour...or if I am totally missing something obvious. I like problems that resolve themselves, but I always like to know the "why" for next time. Thanks! Blair @ Weather Data/Austin TX
ldm-users
archives: