Sunday
Jul272008

Reverse Engineering the Windows Home Server Backup Database

I think it was when ColinWH replied to my request for help I had decided that I'm going to attempt a crude reverse engineering attempt of the Windows Home Server backup database. I didn't want to know how the whole thing worked, just enough to get that darned Control.4096.dat reconstructed.

Unfortunately, Control.4096.dat was not an XML file at all. It was not a very big file, only 4KB, and it did contain something in the middle that looked like XML. I'll save you the trouble or reading through the excruciating description of how I deciphered the binary format that the Home Server uses. Surprisingly, it didn't take long. I had a good part of the basic infrastructure figured out on the next day after I posted the initial request for help. It was somewhat crude and incomplete but it gave me enough information to begin to think about reconstructing the missing file.

In fact, on the next day, I was able to load my so called lost-cause backup database and pull files off. It worked so well that I pulled off 2 entire partitions and since have imaged them to DVDs.

In the process of reverse engineering the DB format I decided to document everything, because it seemed to me that this knowledge can be valuable to other's who might get into similar circumstances. Also, I had the thought of developers making applications that work directly with the WHS Backup DB using my unofficial spec. as the starting point.

One of the reasons that this blog exists is so that I can post my original reverse engineered spec. to a more permanent place. It was originally available in this forum post.

It's now been just over 2 weeks since my catastrophe. I've figured out most of the database format. I've made my first tool based on my spec. (this is coming up shortly). So what follows will be the original spec. edited and updated for correctness.

Sunday
Jul272008

Server Recovery Data Loss - Backup Corruption

So the neat thing about the Windows Home Server (knowing Microsoft, that link will probably be dead by the time you read this) is that it pools all your hard drives into this one massive storage pool, it's easily expandable and provides flexible folder-level RAID 1 like data redundancy. It accomplishes this feat by automagically shuffling your files around for you to all the different drives that are part of the storage pool as it sees fit.

One aspect of this balancing process is that it tries to keep all your data off the main system drive. The rationale for this is that if the main system drive fails you can do a non-destructive server recovery process where it re-installs a brand new copy of the OS to the system drive, and since your actual data files are elsewhere, it will create the proper tombstone links from the data portion of the system drive to your actual files that live elsewhere.

Good theory.

So to continue my story from where we left off last, I was about to enter this server recovery process. I've since pulled the dying system drive, checked it for bad sectors, imaged what was left of it just to have something to fall back to in case everything went horribly wrong from here on. By the way, it turns out that it had bad sectors all over the beginning of the drive so the OS was definitely on its last leg.

Now that I was able to get the server recovery wizard to actually find the server, I proceeded to click the server recovery option (versus Factory Reset which implies data loss). The wizard let me know that it will try to recover all my files. Since I've verified earlier that all my files were actually on the non-failing drives I had high hopes.

An hour or so later...

Server recovery was apparently successful. After going through the standard initial setup I had what appeared to be a functioning server. Even all my shares and data appeared to be there. After opening a few files and getting back what I expected, it looked like server recovery might have actually worked. Imagine that!

I proceeded to install the connector software on my main desktop. Right after that, I noticed something horribly wrong. The backup service was not running and claimed that my backup database was corrupt! How can it be corrupt, it was fine before the server restore? Obviously the server restore process is selective about the files it recovers, and in this case, it chose not to restore some of my backup database.

After digging through the log files I noted that it was complaining about a missing file named Control.4096.dat, and that it was being referenced by Data.4096.NN.dat files.

Not good.

Data loss is never good, but this was a 260GB+ backup database that holds data to some hard drives which are no longer in use. In effect it became my archive. I realize that it was never meant to be used that way, but that's what happens naturally over time, and data loss should never be an expected and accepted outcome. We should always strive to prevent data loss.

Now that I realized some of my files were missing, I went on to load the image I've made earlier of the failing drive, and since it contained a list of all my files (in tombstone link form) I could compare the files in the current backup database with the original working backup database.

Incidentally, the backup database is located in D:\folders\{00008086-058D-4C89-AB57-A7F909A47AB4} on the windows home server. I recommend you don't touch it unless you know what you are doing (or you're like me, who has too much time on his hands).

After comparing the file lists, I quickly discovered that 2 files were missing, not 1:

Control.4096.dat


Data.4096.0.dat

Clearly, the backup database integrity check that the windows home server was doing was not up to snuff. It didn't notice that a whole big ol' 4GB file was missing!

After pulling my data drives and running an undelete recovery tool, I found that the said files were deleted and conveniently overwritten with other data, and thus unrecoverable. The first file, Control.4096.dat, had this done twice to it on two separate data drives. Nice.

So me being me, unable to accept the inevitability of things, I proceeded to think about how I could recover this data. After going through all the sanctioned possibilities, of which there are none really, short of resetting your backup database, I came to the conclusion that I have to try something else.

By the way, can you believe people are actually accepting this as a solution? My backup database is corrupt, umm... erase the whole thing and start over. Can you imagine this in the real world, I have a leaky roof, what should I do? Umm... the house is built in such a way sir where the entire roof is an integral part of the sub-structure and attempting repair on any part of it is impossible. Therefore, unfortunately we will have to wreck your entire house and rebuild it from scratch. Unfortunately, we can't let you enter the house first to retrieve your belongings because the whole thing is bound to collapse on you at any moment.

That's ludicrous!

So I started looking for alternative solutions. I read the WHS technical brief on the topic and I came across this excellent effort by brubber on the official WHS forum (all links are subject to change at Microsoft's whimsy, so don't bet on them working for long). While this was all very useful, it didn't really give me any options I could use.

What I really wanted was to reconstruct that Control.4096.dat file. From looking at the tombstone image I could tell it was only 4KB in size. I had a good chance of getting at most of my data, albeit incomplete, if I could just get that Control file back. Since the backup engine never noticed the gigantic data file missing, I thought maybe it didn't really need it. It looked like the first data file of the whole backup set, and I didn't really care about the first backups I ever did, it's the stuff in the middle that I wanted to recover.

I couldn't believe that a 4KB file was preventing me from accessing 260GB!

Now I don't normally do this, but I think this warranted it, I posted on the official WHS forum for help. I wanted to avoid the standard responses of "did you read the FAQ?" and such, so I got right to the point. Among other things, I asked "Is it possible to get the format of this file or some sort of rebuilding process?".

Surely enough, the first response I got ColinWH, although not very encouraging, was very useful. I got a snippet from this supposed all important control file and it looked like XML! I rejoiced, surely I can recreate this file.

2 weeks later...

No, not really :) But this was only the beginning.

Sunday
Jul272008

MediaSmart Server Recovery Failure

Here we go, my first blog. So let me get right to the point. About 2 weeks ago I started loosing my primary hard drive on my HP MedeaSmart Windows Home Server. First I started getting worrying messages from the WHS console, I then proceeded to check the even log on the server itself and sure enough some really nasty event IDs showed up, which after googling, proved to to be sure signs of impending hard drive melting doom.

I researched my recovery options and it seems that the HP server had a built in facility to do complete non-destructive recovery from such a scenario. All I needed was a new hard drive, a paper clip and another compute with a cd drive. Sounded simple enough.

So I proceeded to pulled the old system drive and replace it with a new drive. I started the server recovery CD from another system on the network, followed the instructions given to me, got out that trusty paper clip and hit the hidden server recovery button on the front of the system just at the right time, and then... nothing! The server recovery wizard went through it's finding your home server bit for about 2 minutes and then proceeded to tell me that it couldn't find anything and suggested that I turn off my firewall. Clearly, that was not the issue.

So, 6 hours later...

After going through multiple routers, multiple ethernet cables, but still the original paper clip, I came to the conclusion that there must be something seriously wrong with the server's BIOS, or whatever process it uses for server recovery. Connecting the server to a router and booting into this "safe recovery environment" didn't produce a single blip on the router's lights. I assumed the worst and was prepared to call HP for a replacement. It was really late at night, so I saved that task for the next day.

However, come next day, relentless as I am, I continued my search for the possible cause of this recovery mode malfunction.

A few hours later...

I come across this post on some obscure forum, which I've lost since, with another person having a similar experience with server recovery. Someone had replied to them to make sure that the server only sees one OS, and that it had trouble booting up with a milti-boot system and that they should unplug any external USB hard drives not part of the storage pool.

Now I don't have a multi-boot system, and why you would want to multi-boot when you're doing server recovery is completely beyond me. But I DO have an external USB hard drive that was not part of the storage pool. Although, how the server knows if a hard drive is part of the storage pool or not at this early stage in booting, is also completely beyond me. But nevertheless, this was one thing I hadn't tried. So I proceed to unplug my single USB hard drive and fire up the wizard again, get out that trusty parer clip one more time, and sure enough, after 2 minutes of searching the wizard had reported that it found my home server. Hooray!

Lesson learned.

I wish they had mentioned somewhere that server recovery DOES NOT WORK WITH EXTERNAL USB HARD DRIVES PLUGGED IN.

But little did I know, my problems had only started. Because at this point, I hadn't actually lost any data yet.