NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
Hi Tom, >From: Tom McDermott <address@hidden> >Subject: LDM: out of per user processes >Organization: SUNY Brockport >Keywords: 200012141622.eBEGM4o06206 LDM processes In the above message, you wrote: > This morning when I came in (delayed several hours beacause of a > snowstorm), no users were able to access our server. The reason for > this was clear from these messages in the system log: > > Dec 14 06:31:03 vortex unix: NOTICE: out of per-user processes for uid 214 > Dec 14 06:32:25 vortex last message repeated 23 times ... > Now uid 214 is the ldm, so it is the likely culprit. This happened > once before several months ago. At that time I recompiled ldm with > just the '-O' option, since I suspected that a target option I > originally used in compiling ldm might have been the cause. But it > appears I was wrong about that. > > Info: SparcStation 10/712MP, 512MB, Solaris 5.7, ldm 5.1.2 . > > I couldn't find anything on this in the ldm-support archives after a > quick search, so I thought I'd ask if you had any ideas. I suppose > it could be a lot of things, pqact spawns tons of processes. I've been unable to identify any certain LDM-related cause for this, but I can offer a couple of theories that you can evaluate. The only time we have seen anything like this here was in August when an LDM host had its load average climb to 2000 (!), and we determined that this was caused by a different LDM host running on an HP-UX 11.0 system hammering it with FEEDME requests. We have never successfully gotten around the RPC library problems on the HP-UX 11.0 platform, so we distribute HP-UX 10.20 binaries for it and recommend people build the LDM using the HP-UX 10 compatibility mode for HP-UX 11 platforms. So one possibility is that some downstream site built the LDM for HP-UX 11 and then requested data from your site many times per second, causing an LDM sender process to be launched for each such request. The only sites we see feeding from your vortex host are blizzard.weather.brockport.edu and catwoman.cs.moravian.edu, but we don't have a record of whether either of these is an HP-UX platform. Do you happen to know? We've just gotten a new HP-UX 11 platform in, so we hope to be able to fix or find a workaround for this problem in the near future. Another possible cause Anne had seen was upgrading to 5.1.2 without remaking the queue, but I have been unable to duplicate this problem here and can't understand how that could cause spawning any additional processes. When I tried it here, the LDM just reported the problem and exited, as it is supposed to do: Dec 14 20:09:06 rpc.ldmd[25256]: ldm.pq: Not a product queue Dec 14 20:09:06 rpc.ldmd[25256]: pq_open failed: ldm.pq: Invalid argument Dec 14 20:09:06 rpc.ldmd[25256]: Exiting Dec 14 20:09:06 rpc.ldmd[25256]: Terminating process group A final possibility is that the problem is caused by some decoder or other program or script launched by pqact. It is relatively easy to write a recursive shell script that quickly consumes the process table if there are no per-user limits set for a user who tries to debug and run such a script (I've done it!). If you have other users on the machine, one of their programs could have spawned processes recursively or in a loop and used up all the process table entries, so when the LDM tried to spawn a decoder process, it hit the limit and produced the message. Here's a couple of suggestions that might help diagnose the problem. First, take ps snapshots (or use top) to see all the ldm processes running and try to account for each one from log file entries, to make sure there aren't any extra processes being created. The "pgrep" command on Solaris 2.7 and later is useful for this, for example pgrep -fl -u ldm shows all processes owned by user "ldm", and piping this into "wc -l" would give you a quick count of ldm processes, and would let you monitor if ldm processes were climbing slowly. But this would be of no help if something triggers spawning a bunch of processes quickly. If that happens, it would help to have ps or pgrep output, but to catch it you might have to run a cron job that dumped the output of pgrep to a file every minute (overwriting the previous file or ping-ponging between two files), so that if your process table filled, you would have a record of what things looked like within the previous minute. The only other suggestion I can offer is to set a limit on the number of processes that can be spawned by the ldm user, to make sure it doesn't use up processes needed by other users or cause the system to run out of process slots. I had thought you could use the "ulimit" command to set this limit, but having read the ulimit man page, I don't see how to do it. I'll send email to our sysadmins, in case one of them knows and pass on any useful answer I get. I'm still very interested in resolving whether this is a symptom of an LDM bug, so if you find out anything else, please let me know. Thanks! --Russ _____________________________________________________________________ Russ Rew UCAR Unidata Program address@hidden http://www.unidata.ucar.edu