NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.

To learn about what's going on, see About the Archive Site.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20011204: LDM: pqbinstats & system crash



Unidata Support wrote:
> 
> ------- Forwarded Message
> 
> >To: <address@hidden>
> >From: Tom McDermott <address@hidden>
> >Subject: LDM: pqbinstats & system crash
> >Organization: UCAR/Unidata
> >Keywords: 200111301311.fAUDBmN10047
> 
> Hi,
> 
> I don't think this is the sort of problem that lends itself to a solution,
> but I thought I would report anyways.  My server is a Sun SparcStation 10
> running Solaris 7 and ldm 5.1.4.  At 3:34AM EST today something happened
> which seems to have been triggered by the pqbinstats program.  From the
> system log:
> 
> Nov 30 03:34:43 vortex unix: BAD TRAP: type=2 rp=fc0997c4 addr=0 mmu_fsr=0
> rw=0
> Nov 30 03:34:44 vortex unix: pqbinstats:
> Nov 30 03:34:44 vortex unix: Illegal instruction
> Nov 30 03:34:44 vortex unix: pid=20212, pc=0xf00647fc, sp=0xfc099810,
> psr=0x4080
> 10c5, context=144
> Nov 30 03:34:44 vortex unix: g1-g7: 78727300, 0, f8ba76d8, 640, fc099b80,
> 1, f73
> 3e9a0
> Nov 30 03:34:44 vortex unix: Begin traceback... sp = fc099810
> Nov 30 03:34:44 vortex unix: Called from f008fda4, fp=fc099878,
> args=f8ba76d8 fc
> 099a38 fc099b80 fc099ee0 fc099b80 0
> Nov 30 03:34:44 vortex unix: Called from f0090148, fp=fc0998d8,
> args=fc0999c0 fc
> 099a38 f8ba76d8 0 0 4000000
> Nov 30 03:34:44 vortex unix: Called from f0066e94, fp=fc099b80, args=0
> efffec70
> 0 0 0 1f22c
> Nov 30 03:34:44 vortex unix: Called from 13444, fp=efffef10, args=f 38520
> 198 0
> 3c06afe0 66
> Nov 30 03:34:44 vortex unix: End traceback...
> Nov 30 03:34:46 vortex unix: panic:
> Nov 30 03:34:46 vortex unix: Illegal instruction
> Nov 30 03:34:46 vortex unix:
> Nov 30 03:34:46 vortex unix: syncing file systems...
> Nov 30 03:34:46 vortex unix:  18
> Nov 30 03:34:46 vortex unix:  5
> Nov 30 03:34:46 vortex unix:  4
> Nov 30 03:34:46 vortex last message repeated 19 times
> Nov 30 03:34:46 vortex unix:  cannot sync -- giving up
> 
> This by itself wouldn't have been too bad, but as the last message might
> lead you to suspect, when the system rebooted, the product queue was
> corrupt.  But instead of the ldm system stopping, the rpc.ldmd server and
> pqact processes continued to run and more server processes were spawned as
> downstream sites kept trying to connect.  This led to a situation where
> the rpc.ldmd processes almost completely chewed up the CPU:
> 
> last pid:  7035;  load averages: 94.12, 92.54, 87.27              07:06:25
> 188 processes: 90 sleeping, 92 running, 3 zombie, 3 on cpu
> CPU states:  0.0% idle, 95.7% user,  4.3% kernel,  0.0% iowait,  0.0% swap
> Memory: 512M real, 338M free, 107M swap in use, 1065M swap free
> 
>   PID USERNAME THR PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
>   550 ldm        1  58    0  301M 2696K run     15:55  1.75% pqsurf
>  5735 ldm        1  59    0  293M 2692K run      1:44  1.64% rpc.ldmd
>  5436 ldm        1  59    0  293M 2672K run      2:37  1.44% rpc.ldmd
>  5076 ldm        1  49    0  293M 2672K run      2:33  1.36% rpc.ldmd
>   552 ldm        1  49    0  293M 2328K run     15:42  1.34% rpc.ldmd
>   549 ldm        1  49    0  293M 3280K run     16:11  1.31% pqact
>  4989 ldm        1  59    0  293M 2672K run      2:55  1.27% rpc.ldmd
>  1780 ldm        1  59    0  293M 2684K run      6:56  1.23% rpc.ldmd
>  1419 ldm        1  58    0  293M 2672K run      8:50  1.22% rpc.ldmd
>  4487 ldm        1  59    0  293M 2672K run      3:46  1.17% rpc.ldmd
>  6188 ldm        1  59    0  293M 2684K run      0:55  1.16% rpc.ldmd
>  2542 ldm        1  59    0  293M 2680K run      5:50  1.14% rpc.ldmd
>  1049 ldm        1  49    0  293M 2692K run     11:06  1.13% rpc.ldmd
>  4802 ldm        1  59    0  293M 2680K run      3:24  1.12% rpc.ldmd
>  5892 ldm        1  54    0  293M 2672K run      1:09  1.11% rpc.ldmd
>  6827 ldm        1  49    0  293M 2676K run      0:07  1.11% rpc.ldmd
>  5159 ldm        1  49    0  293M 2672K run      2:22  1.10% rpc.ldmd
>  6420 ldm        1  59    0  293M 2680K run      0:38  1.10% rpc.ldmd
> 
> But after manually killing the rpc.ldmd processes (ldmadmin stop didn't
> work), I remade the queues and all is now well again.
> 
> Tom
> -----------------------------------------------------------------------------
> Tom McDermott                           Email: address@hidden
> Systems Administrator                   Phone: (716) 395-5718
> Earth Sciences Dept.                    Fax: (716) 395-2416
> SUNY College at Brockport
> 
> ------- End of Forwarded Message

Hi there, Tom,

In two years I have not heard of pqbinstats crashing.  If you have a
core file we can see where it crashed and what it was doing, which may
or may not lead us to a conclusion about why it happened.

One possibility is a bad disk block.  It could be that the reboot
detected and repaired that - would that appear in your logs?  If this
happens again, you could run fsck to scan for bad blocks.  If there are
bad blocks underneath the ldm installation, a reinstallation would be
prudent.  

I believe pqbinstats reads the queue, so that might explain the queue
corruption.  It is not uncommon to see the runaway rpc.ldmd processes
once the ldm gets in such a confused state.  At that point, killing them
by hand like you did may be the only option.

Anne
-- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************