NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
Manuel, > We execute the following lines to check if LDM is running... > > kill -s 0 $(cat ldmd.pid) && exit 0 > > ps -elf | mail -s "TIGGE: ldm is not running on $HOSTNAME" > address@hidden address@hidden > > killall -9 rpc.ldmd pqact ldmping rtstats send || true > ldmadmin clean > pqcat -l- -s -q /usr/local/ldm/data/ldm.pq && pqcheck -F -q > /usr/local/ldm/data/ldm.pq > ldmadmin start > > LDM has just crashed again. This is the output of the script that > contains the above lines, which has just executed: > > ++ cat ldmd.pid > + kill -s 0 14972 > -bash: line 6: kill: (14972) - No such process > + ps -elf > + mail -s 'TIGGE: ldm is not running on tigge-ldm' address@hidden > address@hidden > + killall -9 rpc.ldmd pqact ldmping rtstats send > ldmping: no process killed > + true ... > 0 D ldm 610 16806 0 76 0 - 1004497 sync_p 17:10 ? 00:00:00 pqinsert -v -l > /usr/local/ldm/logs/ldmd.log -p z_tigge_c_rjtd_20061202120000_glob_prod_pf_pl_0060_014_0300_v.grib:14747 ... 1 S ldm 14982 1 1 76 0 - 1005582 - 15:51 ? 00:01:34 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14983 1 3 75 0 - 1005582 - 15:51 ? 00:02:34 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14984 1 3 76 0 - 1005582 - 15:51 ? 00:02:36 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14985 1 0 75 0 - 1005649 - 15:51 ? 00:00:14 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14986 1 5 75 0 - 1005581 - 15:51 ? 00:04:40 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14987 1 0 75 0 - 1005649 - 15:51 ? 00:00:13 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14988 1 5 76 0 - 1005581 - 15:51 ? 00:04:24 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14989 1 0 75 0 - 1005649 - 15:51 ? 00:00:14 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14990 1 5 75 0 - 1005582 - 15:51 ? 00:04:26 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14991 1 0 75 0 - 1005649 - 15:51 ? 00:00:15 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14992 1 5 76 0 - 1005582 - 15:51 ? 00:04:23 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14993 1 0 75 0 - 1005648 - 15:51 ? 00:00:15 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14994 1 5 75 0 - 1005582 - 15:51 ? 00:04:30 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14995 1 0 75 0 - 1005648 - 15:51 ? 00:00:13 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14996 1 5 75 0 - 1005582 - 15:51 ? 00:04:24 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> 1 S ldm 14997 1 0 75 0 - 1005648 - 15:51 ? 00:00:14 rpc.l dmd -P 388 -v -q /usr/local/ldm/data/ldm.pq /usr/local/ldm/etc/ldmd.conf<br /> ... It appears that the "kill -s 0" to the top-level LDM server is returning with an unsuccessful status, indicating that the LDM server isn't running, when that is not the case. The subsequent "killall -9" abruptly terminates all LDM processes -- including those that have the product-queue open for writing (e.g., the pqinsert(1) process and, probably, some of the "rpc.ldmd" processes) This is why the product-queue is getting corrupted. I do not know why the "kill -s 0" indicates that the LDM server isn't running. You might check your documentation on that command. It's also possible that an "ldmadmin stop" was executed just before the script but that not all the processes had terminated. In order to fix this, the "kill -s 0" command needs to be fixed, or it should be executed multiple times in order to be certain, or the "killall" command should be replaced with an "ldmadmin stop" so that processes that have the product-open for writing have a chance to close the product-queue. if the pqinsert(1) process was started by an EXEC entry in the LDM configuration-file (ldmd.conf) then it will receive a SIGINT and terminate gracefully. In any case, a "killall -9" probably should not be executed. Try an "ldmadmin stop" instead. Regards, Steve Emmerson Ticket Details =================== Ticket ID: EVX-684652 Department: Support IDD TIGGE Priority: Normal Status: Closed