In 2008 I set up a NAS with 4x750GB drives in a RAID-5 array. Back then such a setup was pretty awesome (and expensive). I used it to store my digital family pictures on. Such an awesome setup with RAID-5 and a cold spare (just in case) made me feel safe about my data. I felt so safe that I totally forgot to make sure all files were also backed up somewhere else. Then in 2012 disaster striked and disaster was looking like this:
The Promise Fasttrak TX4310 RAID (actually a fake-raid) controller is showing that the 2TB array is “offline”. It should have gone “critical” first when the first drive failed, but for some reason it didn’t (or it did and nobody noticed). Now two drives have failed and the array went “offline”. So, what to do next?
First rule of data recovery: don’t panic!
Second rule: do not write anything to the disks.
Repairing the server
After not panicking I decided to create a new array with a set of fresh drives and take my loss. I removed the controller and the four 750GB drives and connected two 3TB drives directly to the motherboard. The NAS was running Windows (don’t ask why) until then and at that moment I switched to Linux software raid (using mdadm). This was fairly easy to setup and it would even automatically email me in case of a failure. Great, but a little too late.
I stored the drives (together with the controller) in a box. I made sure I marked the drives with the corresponding port numbers before I disconnected them. Storing the drives allowed me to later decide whether or not the missing data was worth recovery. Last week (2015, three years after the crash) I decided I would actually give recovery a try.
Analyzing the problem(s)
I had never done any RAID data recovery before, so I was expecting a rough journey and it sure took a lot of time (2 weeks, every evening). First I connected all drives individually to the second SATA port of the motherboard of my Linux box. Then I read the SMART status of the drives. The SMART information will tell you whether or not a drive is healthy. I found that three of the four drives were actually healthy and one (drive 2) did not get recognized at all.
The drive was not even ticking, which most crashed drives do. The drive actually did not make sound at all. This led me to believe that the PCB might have gotten blown up. I checked, but unfortunately the crashed drive was not having the exact same part number as the spare drive. Otherwise I might have tried swapping the PCB from the spare to the broken drive.
The controller also did not recognize all the drives. The second one was “missing” (which was expected) and the fourth was reported as “free” (which is strange). After investigation I found out that this would most probably mean that the meta information on the position of the fourth drive in the array got corrupted. This meta information is stored in a so-called superblock and that is stored in the end of the drive.
Imaging the drives
Before I was going to do anything I decided I needed disk images. The good SMART reading made me optimistic about the chance of successfully imaging 3 out of the 4 drives. So I bought a big 6TB drive and started imaging the drives one-by-one. I do recommend that you do not simply use “dd”, but that you actually use the enhanced “ddrescue” tool. It allows for retrying and skipping bad blocks. Also, it stores progress in a log file, so that it can continuing when it was interrupted.
After I imaged all three working drives it turned out that one of the drives (the one that reported “free”) was having some bad blocks. Nevertheless “ddrescue” was able to make an image of the drive. It just took a little longer.
I decided to look into the end of the drive using a “dd” with a “seek” parameter and piping that through “hd” (hex dump). I found that at the end of the images there was indeed a superblock. The one from disk 4 actually looked different than the ones from disk 1 and 3. Now all I had to do is recreate the superblock and write it to the disk. After that I expected the array to return to “critical” state (one drive missing). In this state the array should be readable.
After imaging I decided it was time to take some risk and break rule 2 (don’t write to the disk). I tried many things to recreate the superblock. I tried to use “ghex” and repair the superblock by hand (only 8 bytes were different between the superblocks of disk 1 and 3). I tried recreating the array WITHOUT INITIALIZATION, so that it would only write new superblocks. This also did not work, not using the Promise BIOS and also not using the WebPAM software from Windows. I guess this method did not get the array’s RAID parameters exactly right.
ReclaiMe to the rescue
Then I read a positive review on the web about some proprietary (but free) Windows RAID recovery software. In order to be able to run it I created a Windows 7 VM using KVM on my Linux box and attached the images to it using the SATA driver. Then I installed “ReclaiMe Free RAID Recovery” from www.freeraidrecovery.com and gave it a try. I was skeptical, but I should have not been. After some extensive searching on the disks the software found a RAID-5 array with a missing drive. That was music to my ears!
“ReclaiMe Free RAID Recovery” gave me the option to recreate the array to a new disk. I quickly created a sparse 2TB image on my 6TB drive and added it as another drive to the VM. Then it took the software 40 hours to recreate the array into this image. But after that, even without reboot, Windows identified the NTFS partition. I was able to access all my data again. I can not explain how happy and amazed I was. I powered of the VM, loop mounted the image on my Linux box using “kpartx” and was able to copy everything to the new Linux NAS server.
I recovered the picture below (and thousands of others).
This particular picture shows me (right) and my twin brother (left) behind my PC (web-cam shot from 28th of December 2000).
Disclaimer / Warning
I do NOT recommend you to data recovery without any experience. There is a fair chance that you make a mistake. If you accidentally write to the (original) disks you may lose the data forever, so be aware. That said, if you can actually copy the disks to images and/or new disks, then you have some freedom to experiment. If you are really lucky and the disks are not (severely) damaged then you may even be successful, just like I was.