Sunday, July 28, 2013

Storage Pools, Dynamic disks, and software RAID experiences

A quick disclaimer - I am no where near an expert on Windows Server nor on server equipment.  As I have with programming, I'm learning and struggling on my own - best way to learn in my opinion.  With that, I know I am not using recommended equipment nor recommended practices so some of this may be my own fault, but, if I'm out there trying to do this, there will be others as well - hopefully this'll help someone out there or help prevent others from wasting as much time as I have on this...  But, as with anything else, your experiences may vary and I am not responsible for any data loss - this is what worked for me and may help or it may not.

As with many of us in the digital age, I have a TON of data that I want to keep and ensure it doesn't get corrupted or lost thru equipment failure.  In my case, this was over 1TB of WinRar files of personal file backups (from files dating back to 1993), 500GB of pics and videos, among some Tivo files, etc.  On top of being a digital pack rat, I also hang onto computer equipment and gather equipment as I come across it.  So, two of my prior computers were not being used and I had a ton of spare [i.e. not being used] SATA hard drives so I decided to build two file servers from these pieces.

The first system was based on an Intel 975XBX motherboard.  Since I planned on loading this thing with SATA drives, I decided to get a cheap server case and my first choice was the Rosewill RSV-L4500 (http://www.newegg.com/Product/Product.aspx?Item=N82E16811147164).  This case is a beast with 15 internal drive bays.  The case worked ok except when the drives failed or when I wanted to upgrade drives - having to open it up, find the failed drive (harder than it sounds with no indicator lights on which drive failed), and replace it along with moving the internal fan bar - anyway, it was a pain to replace drives in this case.  Good case, keeps the drives cool, but would be easier with hot swap drive bays.  In addition to the motherboard, I added 2 4-port SATA controllers but did not necessarily go after those controllers being RAID - I wanted cheap, with good ratings, with 4 SATA II+ ports.  One of the controllers, a Koutech IO-PSA420 (http://www.newegg.com/Product/Product.aspx?Item=N82E16816104006) did include RAID support.  One of the motherboard's controllers also supported RAID as well.

The second system was based on an Intel 975XBX2 motherboard (very similar to the prior one).  Since I learned the lesson on needing hot swap bays, I went after a similar Rosewill case, the RSV-L4411 (http://www.newegg.com/Product/Product.aspx?Item=N82E16811147165).  Again, had to get two additional 4-port SATA controllers - same story as before, none were specifically RAID - this time I went after SATA III controllers.  I bought two of the Syba SI-PEX40064 controllers (http://www.newegg.com/Product/Product.aspx?Item=N82E16816124064).  At least one of the controllers on the motherboard included RAID support.

Now, for the drives - the drives ranged from 60GB drives to 1TB sizes, a few being enterprise level drives, some being notebook drives (1TB ones taken out of new systems in favor of 120GB SSDs - yes, they're that much worth it!).  Wide mix of brands and sizes.  I had more than enough drives to completely fill both servers - 27 drives total with a total of about 5TB of storage space - all individual drives.

Initially these systems had Windows Server 2012 Essentials loaded on them.  While nice and worked (kinda - domain issues with Essentials), I was not fond of the split usage of the drives - a backup share, for instance, was across several drives because the drives didn't have enough concurrent space to hold the entire folder.  I looked at resolutions to this - my first discovery was data deduplication, which is a new feature in Windows Server 2012, but, of course, was not available to Essentials.  Data deduplication claimed space savings as high as 75% or higher...so in theory a 1TB drive now becomes a 1.75TB drive?  I figured it was worth the upgrade risk...

So, after the upgrade, I noticed the feature of storage spaces, which is also new to Windows Server 2012 (and available in Windows 8).  I read up on it and essentially it is a software RAID, but with a bit more control.  I took all my individual drives, added them to a mirrored pool, and copied data to the pool.  While I had a nice big drive, and it seemed to be fairly safe, my transfer rates were horrible - I'm used to 30-60MB/sec transfer rates and I was getting 10mb or less per second, with often minutes at 0mb/sec.
After the failed storage spaces attempt, I tried data deduplication, which, in theory, would help on my space issues.  I turned it on for several of my larger drives, ran a power shell command to force it to deduplicate the data and I later found I had 300GB extra on that 500GB drive - nice.  I tried moving files and a similar problem happened with storage spaces - remove a 50GB file and I get back 50K of space?  Yes, data deduplication worked but with that type of recovery after a delete, I needed to combine drives as I was still having the issue of the split share across drives.

So, now, RAID...something I've tried before but never had good success with but figured worth a shot again.  Since I had few hardware RAID controllers, I figured I'd try software RAID - via Disk Management in Computer Management.  Software RAID in Windows is done via dynamic disks - the disks can be spanned (JBOD), striped, mirrored, or RAID 5...there is no option for RAID 6, 10, etc.

On the hot swap server, I set up 2 1TB mirrors (on a pair of WD Green 1TB drives, and a pair of Samsung 1TB laptop drives).  I moved all my data from my individual drives onto one of the two mirrors (one for the file backups, the other for the pics, etc).  The performance was poor, again with lower than expected transfer rates (this is a mirror so I was expecting some slower).  I read that RAID-5 and RAID 10 was a better solution for better performance so I decided to go that route.

After over a day of copying, I set up a software RAID 5 array with 3 500GB hard drives.  My thoughts were to put the data there and then switch the 2 software mirrored 1TB's to RAID 10.

Performance of the RAID 5 was better than the mirror but still pretty sluggish at times (i.e. KB/sec) - but at other times I was able to get 150MB/sec, which is pretty good compared to what I had been seeing.  I did not enable data deduplication on these drives as there was a clear warning to not do that on a dynamic volume - if the dedupe database got corrupted, it was possible to lose all data!

Once all the data was copied onto the RAID 5 array, I rebooted and set up the hardware RAID 10 array using the 4 1TB hard drives.

Upon starting up Windows again, the RAID 10 array showed up as a 2TB drive - as expected.  I figured I'd get the data off the RAID 5 onto that single logical drive, but, one major problem - the RAID 5 showed as "failed."  The good thing about RAID 5, in theory, is if one drive goes off the array or goes bad, the array can rebuild itself and recover.  So, figured I'd reactivate or try to rebuild the array but everything I tried that was suggested on Google and Microsoft, failed.  I was not able to get access to the data on the drive or even get the drives to rebuild.  I tried switching the drives to a different controller (software RAID and Windows keeps track of the drives that are in the array) but still no luck.  I even tried a few suggested software packages that were rated high for recovering data from corrupted RAID 5 volumes and after a week and a half of scanning from products such as Zero Assumption Recovery and ReclaiMe Free RAID Recovery, the most I was able to get valid was 4 50k files and 450MB of very corrupted RAR files.  No luck.

Figured I'd pull the RAID 5 drives out, back them up at the sector level, and put them into the other Windows Server to see if I could import the array and get the data that way.  During this, I also checked the drives and all were perfectly fine sector wise.  But, one of my favorite products for partition/backup of drives, Paragon Hard Disk Manager 12, decided that it would not do a sector backup of any of the drives - never gave a good enough error message to determine why.  At this point, I figured I'd already lost the 1TB of data so I was willing to try anything.

I put the failed drives into my other Windows Server and imported the foreign disks.  It showed the correct number of drives for the array (3), as expected.  I clicked ok and it came up with the size I was expecting as well - about 1TB, but the condition was "data incomplete."  Figured, ok, this may work - except the next error message gave me cause for concern - "some of the volumes you are importing will lose data because you have not moved all your disks to this system.  Are you sure you want to continue?"  Um - WHAT? I imported all the drives! Can't do a sector by sector backup and now Windows tells me this.  Did some research online and they said this error message was a bug and that it was "safe" to continue.  Figured I'd go with the general crowd sourced info and go for it - clicked ok and now I was right back to where I started with the prior server - the entire RAID 5 showing as "failed."  Like on the prior server, tried reactivating the disks (all of which showed as online).  Kept getting the "the attempted operation cannot be completed.  The selected volume is offline."  I eventually found some documentation on the Window's diskpart command (a dos version of computer management, which has a bit more capabilities than computer management).  The information it showed showed that everything should have been good and that the drive was ready...but, I still could not repair it (it didn't show it as failed), etc.

Then I came across a similar utility to diskpart that had directory and copy capabilities and was written in C# - all were pluses as I just wanted the data off the drive.  The tool is called RAID5Manager and was written by Tal Aloni - (http://iknowu.dnsalias.com/files/public/RAID5Manager/Raid5Manager.htm).  I tried the utility and was able to access the array without any issues...and doing a "dir" command, I was able to see directories!  Changing to the various directories, I was able to also see the files I expected to see.  I figured this was too good to be true so I copied a small file to another drive, opened it up and it worked!  Tried a larger RAR file, since RAR has built in consistency checks, and to my surprise WinRar opened the file without complaint!  Fun part was the utility only does single copies of files - no wildcards, etc, but, since the utility was written in C#, it wasn't hard for me to add that myself.  After about 5 minutes of code changes, I redeployed the updated utility to the server and did a copy and all files went to the backup drive without any issues.  I checked many files and all were in tact!  This utility saved me from losing 1TB of backup data to a failed data redundancy RAID 5, that, according to diskpart, was not really failed.  Not sure why/how this array got in this state (nor how to get it out except blow it away) but this tool helped recover all the data.

So, lesson learned, while I got "lucky" this time and was able to get my data back, ALWAYS backup data in multiple places - even things such as RAID that should recover data in case of failure will fail - the question is just "when."  In my case, I believe a Windows Server bug bit me hard on this failed RAID 5 volume but no proof either way.

To sum up my experiences:
1.  Storage pools - not a good idea, especially if you'll be moving files or need performance (i.e. media server).
2.  Dynamic disks - good concept but RAID-5 seems buggy.  Performance was ok but not consistent.
3.  Data deduplication - good concept/theory, but do not use on a RAID/dynamic volume!  Great returns for data that is rarely used (i.e. backups, archives, etc) but not great for those who move files often (a 50GB file deleted does not mean you get 50GB back - could be very little recovery).

All the suggestions out there always say use hardware RAID, if you need data redundancy, and use RAID compliant drives - which means more money.

Anyway, hope this helps someone out there, especially getting data off a failed dynamic RAID 5 volume that has failed.