Replicating load related crashes in non-production environments,
We’re running a custom application on our intranet and we have found a problem after upgrading it recently where IIS hangs with 100% CPU usage, requiring a reset.
Rather than subject users to the hangs, we’ve rolled back to the previous release while we determine a solution. The first step is to reproduce the problem — but we can’t.
Here’s some background:
Prod has a single virtualized (vmware) web server with two CPUs and 2 GB of RAM. The database server has 4GB, and 2 CPUs as well. It’s also on VMWare, but separate physical hardware.
During normal usage the application runs fine. The w3wp.exe process normally uses betwen 5-20% CPU and around 200MB of RAM. CPU and RAM fluctuate slightly under normal use, but nothing unusual.
However, when we start running into problems, the RAM climbs dramatically and the CPU pegs at 98% (or as much as it can get). The site becomes unresponsive, necessitating a IIS restart. Resetting the app pool does nothing in this situation, a full IIS restart is required.
It does not happen during the night (no usage). It happens more when the site is under load, but it has also happened under non-peak periods.
First step to solving this problem is reproducing it. To simulate the load, we starting using JMeter to simulate usage. Our load script is based on actual usage around the time of the crash. Using JMeter, we can ramp the usage up quite high (2-3 times the load during the crash) but the site behaves fine. CPU is up high, and the site does become sluggish, but memory usage is reasonable and nothing is hanging.
Does anyone have any tips on how to reproduce a problem like this in a non-production environment? We’d really like to reproduce the error, determine a solution, then test again to make sure we’ve resolved it. During the process we’ve found a number of small things that we’ve improved that might solve the problem, but I’d really feel a lot more confident if we could reproduce the problem and test the improved version.
Any tools, techniques or theories much appreciated!
That’s the answer Replicating load related crashes in non-production environments, Hope this helps those looking for an answer. Then we suggest to do a search for the next question and find the answer only on our site.
The answers provided above are only to be used to guide the learning process. The questions above are open-ended questions, meaning that many answers are not fixed as above. I hope this article can be useful, Thank you