NOTICE: This version of the NSF Unidata web site (archive.unidata.ucar.edu) is no longer being updated.
Current content can be found at unidata.ucar.edu.
To learn about what's going on, see About the Archive Site.
Okay, the whole story goes like this...we had an old sun system that ran these scripts. When I started taking care of it in September, it was obvious that this system was too old and decrepit to continue. Luckily, I was able to convince the powers that be that a new system was needed. The new system came in December, and we transfered all scripts and updated GEMPAK and LDM to the new versions. The system worked fine from about Christmas until maybe a month ago. Around this time I added a couple scripts (model differences). When the crashes started occuring, my immediate thought was an error with one of the new scripts. Therefore, I disabled all of them, to no avail. I then looked and found a java script that had stopped working and thought that was the problem. I disbaled it and found no change in performance. It appears to me that the crashes occur randomly. At different times of the day and after very different uptimes. Sometimes we're up for a week, sometimes (like yesterday) 3 crashes in a single day. Therefore, I conclude that if it is a single script, it's one that runs at least hourly. I've made a list of all of these scripts and am now disabling them one by one to see if I get any results.
Gabe,If we saw problems like that here, we would immediately suspect the hardware. (I'm a little late to this discussion, so pardon me if it's been talked about already.)
Now before I continue: I have to admit that this could very well be a software/operating system problem, BUT it could also be hardware-related.
Three things immediately come to mind: 1) Cooling problem. 2) Bad power supply. 3) Bad (or failed) memory.
For 1), check all fans and all cooling fins on all heat sinks. Clean and/or replace as necessary. Make sure the CPU heatsink(s) is (are) properly seated on top of the CPU chip.
For 2), I don't know of an easy way to test a power supply, so we always swap out suspect power supplies to see if it eliminates the problem.
For 3), run memtest86 if you can afford the downtime. See this website for more information: <http://www.memtest86.com/> (Linux packages are available.)
Some other things: o Could be a bad CPU. We have had a handful of failed AMD CPUs over the past several years.
o Re-seat all expansion cards and cables.o If you are using a 3ware RAID controller, you might consider doing a volume verify.
o Gilbert mentioned buggy BIOSs in another post. You might consider checking with the motherboard vendor for BIOS updates.
Hope this helps a bit. - Bryan University of North Dakota
gembud
archives: