On server disasters and recoveries
Feb
25
Written by:
Thursday, February 25, 2010 8:36 PM
Anyone who visits this site reasonably frequently will no doubt have come across some serious downtime in the last 10 days or so. The errors ranged from out-of-disk-space to out-of-memory and just plain old ‘server unavailable’. The casual visitor must have thought that it was running on some dusty old 486 box with a 52k modem hanging out the side.
The truth is, disaster always strikes when you are least are able to handle it. In this case, disaster kept striking while I was at the hospital welcoming the arrival of a new baby into my family. Being somewhat busy with this more important event meant that the outages probably went on for twice as long as they should have, but that’s life in the small end of the business world.
Symptoms of the Outages
It’s well known that sometimes, servers just develop a problem and a quick reboot gets them on their way. This was the first strategy I used, and at first it seemed to work OK. When the server needed to be rebooted several times a day I knew something was wrong, but the big problem was getting a connection to it via RDP so I could actually ascertain what was wrong. Because as soon as you could get logged on and start to look around, the performance would bog down to the point where the server became unresponsive and lock up. Then a restart, then more checking – you get the picture.
Solutions to the Outages
Due to a misdiagnosis by the hosting company that I use – they guessed that it was some sort of memory paging pool issue as a problem inherent with 32 bit windows Virtual Servers – I started building up a completely new server to replace the old one, albeit on a 64 bit platform.
I actually had grave concerns about this, because one thing I learned when I cut my teeth doing midnight support shifts as a junior programmer was that when you were trying to get a system up and running fast, you absolutely don’t want to get fancy and start putting in other changes. Priority number 1 in production system outages is getting things back to the exact same point as they were before the outage. But sometimes we aren’t thinking straight when it’s late, you haven’t had much sleep and you’ve got a baby about 24 hours old screaming in your ear. (did you see where I slipped in an excuse there?) So I agreed to the new server and started the migration of the various backups over to the new server.
This, of course, went about as smoothly as a wooden cart down a cobblestone road, and it took many, many hours longer than it should have to get a new server up and going. In the meantime the old one was up and down and generally being about as unreliable as a server can get.
During the process of copying across the system I had my first clue as to what was wrong – the database for this site had grown by about 400% over it’s normal size. I went straight for the usual culprit, the EventLog table in DotNetNuke, and saw that it had about 400,000 rows in it. As each row has a bit fat slab of Xml in a text field, this is where most of the space went (just 100 rows contains about 360kb of Xml). I had to truncate the table so that I could get a quick backup that was easy to copy across to the new server.
Cause of the Outages
Finally I did get the new server cut over, and this allowed me time to sift through the wreckage of the old server. I found several gigabytes of memory dumps in the temp directory of the server : this correlated well with the ‘out of memory’ exceptions that had popped up. The server clearly kept running out of memory, but why?
The answer is in the oversized EventLog table. There was a search bot under the User Agent of “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot)” which had been hitting the site very hard. As in multiple requests per second hard. This was causing very odd exceptions like this one:
“DotNetNuke.Services.Exceptions.PageLoadException: Hashtable insert failed. Load factor too high.”
I’ll save you the bother of looking up this exception. It’s basically an internal .NET debug assertion (a type of internal system consistency check) which happens when the Hashtable object built into Microsoft.NET gives up and cries uncle. And the DNN event logger was doing a sterling job, managing at some point to log more than 25 of these exceptions per second. I’ll probably never know how many hits per second this bot was giving me, but it was giving the server a damn good stress testing. And funnily enough, it wasn’t up to the job. No sense in pretending – it doesn’t get the sort of traffic that necessitates being able to perform to that level. If I ever get a post ‘slashdotted’ or ‘digged’ then I’ll probably need to save a static version and redirect it.
So what was this mysterious bot? Well, at first glance it looks like some type of Microsoft search bot. But Microsoft are a legitimate search company, and they don’t having badly behaving bots like this. I checked the server logs and came up with this example hit:
2010-02-17 02:00:31 W3SVC31749 BERTROB 122.201.75.111 GET /products/support_forums/forumid/8/scope/threads/noreply/1/currentpage/2901 - 80 - 83.163.4.2 HTTP/1.0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT;+MS+Search+4.0+Robot) - - www.ifinity.com.au 200 0 0 34208 340 1093
A quick reverse-DNS lookup on the IP address gave me some ISP in the Netherlands. I’m guessing Microsoft doesn’t run their search bots through ISPs – they’ve got deep pockets to have their own servers on the internet. My guess is that script-kiddie was running some downloaded software harvesting email addresses or something like that. I sent an abuse report off to the ISP and blocked the IP address at the IIS level ; hopefully won’t be hearing any more from them.
Anatomy of the Crash
This is my guess of the sequence of events;
1) bot starts hitting site at rates higher than it can handle (effectively a Denial of Service attack, but probably unintentional)
2) asp.net starts to buckle as resources get used up. My guess is that as the application pool recycles, the old pool takes too long to be discarded from memory and has threads still running, which conflicts with the threads in the ‘new’ application pool starting for the recycle. (when an application pool recycles, it opens up a new process and closes down the old process. Usually it’s instantaneous. DNN users will know the familiar purple event log entries which signify an APPLICATION_START – this is usually the application pool recycling)
3) DNN logs the exceptions from .NET and starts to fill up the event log in the database
4) The server starts to run out of disk space from the rapidly expanding database, impacting performance
5) The ‘Purge Event Log’ job in DNN tries to run, having been triggered. This fails with a Sql Server timeout, because the table has grown too big to be effectively deleted.
6) The server runs out of memory and the ASP.NET worker process crashes, triggering a memory dump on the server
7) The server either dies from lack of disk space or lack of memory.
After a restart, assuming the bot has stopped for the moment, the loop re-enters at about number 5, as the event log has grown too big for DNN to effectively cope with, given the Sql Server resources available to it.
This is all speculation, I can’t prove anything as I only have broken pieces, fragments of log files and the sequence of events to put together.
Conclusions
I learnt very early on in my computing career that it was more important to figure out the causes of problems and how they can be avoided than run straight to the blame-pointing phase and the hunt for the guilty. There are the conclusions:
1) always trust your instinct, particularly if it’s borne of years of experience. Don’t try and put together a new server out of the ashes of your old one. Always restore everything back to the same versions of all the software where possible. You can upgrade after you have things stable.
2) don’t be too keen to accept someone else’s explanation of your problems. It’s always convenient to have someone else diagnose the problem for you, but it always pays to think their diagnosis through and ‘sense test’ it, to make sure it fits the evidence.
3) always have a current backup : in this case no data was lost. Ironically I just did a post a short time ago detailing my new backup procedures. These saved my skin and nothing at all was lost.
4) always check your server logs if something funny is going on. The reality is that there are hackers, script kiddies and malware writers out there looking for something to hack/scrape/ruin/annoy. These actions always show up in the logs and anomalies should stick out. While numbers are on your side, it pays to think of this as a plausible explanation to any outages or problems. Someone like myself who is a developer is so used to breaking their own systems with bugs and mistakes they immediately start looking in the wrong places. I assume a sysadmin would start looking in the right places straight away, because they assume that the software is going to be reliable. I suppose a hardware person would start re-seating circuit boards or something.
5) get familiar with the TRUNCATE TABLE EventLog syntax for dumping the contents of your DNN event log. You can’t use ‘Delete’ because it will just fill up the Sql Server log files. A truncate table command literally dumps the data ; no re-do, no log files, no backups : gone straight away. It’s the best way to empty the table.
So welcome to the new server, so far it’s been performing nicely and there has been no downtime since cutting across. But there are a number of changes planned for this site in the coming weeks and months, so I’ll be making sure it stays that way.
2 comment(s) so far...
Re: On server disasters and recoveries
Glad to see you got things squared away, and congrats on the addition!
By Chris Hammond on
Friday, February 26, 2010 12:17 AM
|
Re: On server disasters and recoveries
As Chris said - glad to hear that the site is back up and running. Hope everything continues to go well with your new family member!
By Sam Erde on
Thursday, March 04, 2010 1:16 AM
|