The day my server died
Aug
21
Written by:
Friday, August 21, 2009 5:58 PM
This post is a little out of the usual for my blog, but it’s a tale worth telling, particularly if you’ve got more than one computer and data you just can’t lose.
Just over a month ago I started a new project of moving my office. This involved commissioning and building a new office out of a previously unused corner, and fitting it out exactly as I wanted it to be. A clean sheet design, if you will (within budget and space constraints, of course) This new design would incorporate tidy wiring, a server ‘cupboard’ (really, a mini server room), everything up off the floor (so my Roomba can work efficiently) and, well, would be a calm private place of productivity.
The Move
Things always take longer than planned, especially if the plan is on the back of an envelope. This was bad enough, because I had to set things up temporarily and work with stuff absolutely everywhere for a couple of weeks while the final touches were made. If anyone was waiting on me during this period they probably received an email that said something like ‘please bear with me I’m in the middle of moving’.
Finally, though, everything was in place. My brand-new development machine was up and running with Windows 7 and twin 23” monitors and I started to regain productivity : slowly at first, but I was soon in the swing of things.
The Death
However, there was one thing I hadn’t counted on : a dead server. It happens so innocuously : you make a request to a server, and it is offline. It happened on a Saturday afternoon, so I just ignored with it, thinking “I’ll deal with that Monday'”.
But it was dead. The culprit was a failed main system drive : on my Primary Domain Controller. The only domain controller, as it turns out – who needed a backup for small setup? It was also the email server, file server and remote access server. Heat was probably the issue : my newly designed server storage cupboard was running hot : air temperature over 34 degrees (celsius) – the heat soak on the server rack like a car on a sunny day. I opened all the cases on the computers to give more ventilation and checked all the other drives : too hot to touch. Another computer spontaneously turned itself off : some type of CPU heat shutdown I expect.
The Backup
About this time I cheerily turned to my backups, and thought I’d have the whole thing running again a in a jiffy : how wrong I was. The problem is that with Active Directory is that even if you have all the NTDS database backed up, if you can’t recover the system registry, you’re hosed. And when you have no active directory, you’ve got no domain. When you’ve got no domain, you’ve got no exchange server. No Exchange server, no emails, tasks, calendar appointments. Your other computers will work on cached mode, but when they start looking for network resources with no primary domain controller to work with, that’s all going to fail as well.
The Rebuild
I spent a day trying to resurrect what was left of the disk drive onto a newly purchased system disk. This was frustratingly close to working, yet, as these things go, so very far from actually working correctly. After several attempts to restore, I cut my losses and started from afresh, formatting the hard drive and building an entirely new domain from scratch.
Of course, all of the setup had been years in the making : new share here, an extension there. I had to put it all together. Fortunately I called upon my brother to help me sort it out : he does it for a living so easily solves problems that would have me melting down a Google node with queries.
The other big problem is that all of my install media is out-of-date in terms of service packs. In some cases I had to download gigabytes worth of new installs or service packs to get everything back up to date. This not only takes a long time, but I also blew my download limit and got ‘shaped’ as a result. With Windows update these days, the installs are done silently and you don’t realise how different your systems are from their original install disks.
The Restore
Once the new domain was up and running, it was a case of getting back the exchange backups and setting up the email / appointments / tasks as they were. Only problem with that was all that was set up for a different domain, and the new exchange server in the new domain didn’t want to have anything to do with the backups from the old domain/exchange server. So I had to go another route and get some software to convert the cached Outlook data back into a Personal Mail Folders file (.pst) and re-import it back to exchange. This also loses a lot of the ‘meta’ information like when you replied to emails, what was read and what wasn’t. It does get you 90% of the way back, but it’s not the best way.
But finally everything was back to working as normal, and only a few emails and a few calendar entries and to-do items are gone and lost forever (sorry if you’re waiting on me to turn up for something, or I’ve missed your birthday!)
However, once you’ve destroyed an old domain and created a new one, then there are a lot of little things that drag you down. All of your favourites, profiles, cookies, even your wallpaper is all hooked up to your old domain profile. You start again with all that. Yes I used the files and settings transfer program that Microsoft provide : this just frustratingly transfers across half your stuff. Yes, you might get the right desktop background, but not your authentication cookies or all of your favourites.
Team Foundation Server : always a flighty mistress at the best of times – it’s very sensitive to things like authentication changes. So a complete piece-by-piece pick through of the TFS server finding all the little services, shares, intranet sites and bits and pieces and re-updating the user permissions and everything else.
The Results
I wrote this both as a cathartic : “get it all out” and the classic precautionary tale “don’t do this at home”. You see, with my RAID drives, scheduled backups, offsite storage and more, I figured I was pretty safe from data loss. And I was/am. I didn’t lose any vital work, source code, documentation or anything like that. What I did lose was a lot of time, which affects service, projects underway, client perception and most of all : my sanity after watching more setup progress bars than any sane person should. It would have been much more fun re-coding a lost project than messing about trying to get an Active Directory installation back up and running.
Oh, and I installed a larger extractor fan to the ‘server cupboard’ – it keeps the computers are a nice running temperature all the time.
Here’s what I would recommend everyone stop and think about:
- if you lost a disk drive today, would you be able to get back all of the important data on it? Remember disk drives are like pets : in the end they all die and we’re likely to outlive them.
- if you’ve got a domain setup , do you have backup domain controllers? Do you have a backup of your NTDS database? Are you servers replicating properly – an out-of-date domain controller is pretty much the same as no domain controller at all.
- if you lost your email, do you have a backup solution? In my case I could carry on using web-based mail, but it is a very inferior solution to a full-featured email system like Outlook.
- have you checked your backups recently? Are the jobs working? Are you out of disk space? Have you tried to do a test restore and make sure the data is OK?
- do you download and save your service packs or just let them run automatically? It’s better to save (and burn) the bigger service packs, because always relying on the download takes time and bandwidth. One thing you can count on with Microsoft is that eventually a service pack will be larger than the software it is patching.
When computers are your life, and how you make your living, it’s easy to forget how dependant we all are on them. And yet they fail from time to time : it’s part of the job and part of the technology. So spend some time making sure everything is OK.
7 comment(s) so far...
Re: The day my server died
Absolutely no fun BUT it is kind of a cool feeling to recover from such an event.
In general, I try to have a system drive that I back up with ghost from DOS, that way I can recover all of the system crud without issue. It is hassle to run it once a week or so but it really makes recover fast.
By bill on
Saturday, August 22, 2009 9:20 AM
|
Re: The day my server died
@bill : that's probably the best idea going. A complete system drive hdd image is the way to go : when it dies, restore from image to new HDD, and off you go. My only problem is that with multiple servers it's hard to get it all working reliably.
By Bruce Chapman on
Saturday, August 22, 2009 9:27 AM
|
Re: The day my server died
Bruce,
I feel your pain, I had some RAM go out on my primary development machine, and managed to get back up and running, but it made me think and re-do my backup strategy.
What I have now is Acronis Backup that backs up every one of my critical machines, at the hard drive level, on a daily basis. I can restore from the backup to new hardware or even a Virtual Image in under 2 hours. I've been testing all of the backups, and they work like a charm, I don't know what I'd do without that setup now. I even have a centralized management view that allows me to see all of the backups and their status.
By Mitchel Sellers on
Tuesday, August 25, 2009 4:53 AM
|
Re: The day my server died
@mitchel : sounds like the sort of setup I probably need to implement, to image the system drives and collect on a central drive.
By Bruce Chapman on
Tuesday, August 25, 2009 8:44 AM
|
Re: The day my server died
Another great options is Windows Home Server... Automated backups of multiple machines, centralised fault tolerant storage and remote access all from.
Not only that it can run on a low-power machines such as those with Atom or Celeron CPUs. When you run low on Disk space, you can add more disks to upsize your storage. you can even decommission one to replace it with a larger drive, a smart move as the price for HDDs drop...
PS. When backing up AD domain controllers, also backup the System State and make sure you recorded your Directory Services Restore Password...
By Brett Chapman on
Friday, November 06, 2009 12:34 AM
|
Re: The day my server died
Bruce,
Thank you for the story and the lesson.
Cheers,
S.F.
By Sébastien Fichot on
Tuesday, February 16, 2010 11:49 PM
|
Re: The day my server died
you should use raid 1! would have helped with your problem, also consider getting a server with proper redundant systems, the one proliant i have has redundant cpu, ram, rom and power supplies, plus 4 hdd's running in raid 1 via a hardware raid card, it would take a small bomb to make this thing go down lol
By austin on
Sunday, July 10, 2011 1:37 PM
|