RAID5-Server to hold all your data — the NAS alternative with software RAID

This was an old project of mine. A few years ago I had a huge load of data comming in (~4 TiB) and the amount of storage I needed suddenly more than dubbled. Until then I was using two 1.5 TB HDDs which I mirrored by hand using rsync, because I’m paranoid of loosing data. It was annoying to always copy all data to each disk to have redundancy — and certainly not a smart solution. Now that data wouldn’t fit onto the two disks anyway, so it was time to think of a new solution. I had enough of wasting my time with copying files from one hard drive to another.

I always thought about buying a NAS storage to handle all my data, one with four or more slots for 3.5″ disks to solve the storage problem one and for all. But those systems are expensive — a few hundred Euros for a box with only four slots and no hard drives included! Four 1.5 TB HDDs (biggest on the market that time) would have cost me another 4*80 €. I didn’t have that money. But that was only one reason. I also disliked the fact, that you don’t exactly know the configuration and that you are limited by the functions the vendor provides. Sure, you do get a modified version of GNU/Linux, that even allows you to chose between a few RAID configurations. But I didn’t want some crippled and preconfigured version of Debian and only the choice between RAID1/RAID5. I also had an eye on the Drobo. But beside being incredibly expensive it didn’t support GNU/Linux (back then) and I didn’t really trust the system. A NAS that automatically distributed the data between all HDDs (no matter what size) you put in there while keeping redundancy at all times sounds like the perfect storage device. But since I didn’t exactly know how it works, I didn’t trust it. I prefer to unterstand the things I use (and try to stay with the KISS-principle when I can). In the end the best solution for me was using an unused desktop PC, software RAID and of course Arch Linux (which uses rolling releases).

That way I

  • only had to pay for the hard drives,
  • had all the choices about the software RAID setup,
  • could KISS,
  • could use my favourite GNU/Linux distribution which is always up to date,
  • could recycle an unused PC that otherwise would have spent his days collecting dust in a corner.

The down-side of this approach of course is the long configuration time, although the process is pretty much straight forward (depending on your skill, you can get the basic layout up and running in a few hours, if you have all the hardware at hand). Another contra to consider is the power consumption. To be honest, I don’t know how much power an equivalent NAS device with a 6 disk array consumes — mine needs 92 W idle and 130 W on load, which isn’t even very much, I guess. However, a NAS device with roughly the same configuration needs probably less (and is smaller in size).

I know, a lot of people still dissapprove of software RAID, but good hardware RAID controllers cost (lots of) money and if the chip dies you have to buy exactly the same one to access your data again ̣— in other words, you are dependent on the vendor. On the other hand, people claim that software RAID isn’t as stable as a hardware controlled setup and stresses the cpu. But with cheap “hardware RAID” controllers you can face the same problem. These fake-RAID controllers don’t have a parity unit, thus leaving all parity calculations to the main processor, which practically is software RAID. I’m no expert, all I can tell is that I run several software RAIDs for many years now and never had any trouble. All single drives that died in an array so far could be easily replaced and resynced without hassle (and no down-time). This isn’t really a hack or mod, basically it’s just a selfmade NAS with more configuration freedom using basic GNU/Linux tools.

You have all the possibilities RAID offers: you can grow/shrink an array (gaining/reducing space) by increasing/decreasing the number of drives or replacing each drive with one of bigger/smaller size. You can swap a drive and resync, while the data is still accessable — everything can be done online (given that you have hot-swapable hardware). It’s also possible to add a spare disk, which in case of a disk failure is used immediately to restore parity. Before you start think about what kind of data you want this for! If you plan to use a database (BAARF) or mainly have small files (smaller than your stripe size) or extremly sensitive data, you shouldn’t use RAID5. Some even say, you shouldn’t use RAID5 at all. If that concerns you, use RAID1 or RAID10 to be save.

Let’s start. I used an old desktop PC of a friend, that wasn’t in use anyway. Make sure that machine has a decent power supply, maybe even an oversized one to serve all your HDDs in the array plus disks to come if you want to grow your array in the future. Keep in mind that the power consumption of a drive peaks at spin-up and can be much higher than it’s normal consumption. Luckily, my ATA controller is smart enough to spin up one drive after the other, so i don’t have that peak. The PC should also have enough computation power to handle your disk array (remember, software RAID is wearing on the cpu) and to still execute rsync commands. It is also a good idea to check if the PC you have in mind boots without graphics card. I immediately removed it after I set up the BIOS and installed basic Arch Linux (which actually could also be done over SSH) and did the rest over SSH. That way your server stays cooler, more quiet and consumes less power. Make sure it also has gigabit ethernet, otherwise it is a pain to copy anything bigger than a TiB. One really important thing you should look out for is enough S-ATA/P-ATA ports for your disks! If you can, make sure they all use separate buses. If some of them share the same data bus, it can lead to data loss when a bus is malfunctioning (basic RAID configurations only survive one HDD corruption at a time until reconstruction)! Of course, normally you could add missing gigabit ethernet and hard drive ports via PCI, but depending on your board they sometimes share the same bus(es) with the ports on the board.

Make sure to have good understanding of (software-)RAID and the dangers in combination with your used hardware before trusting it all your sensitive data! The wikipedia entries for RAID and mdadm porvide lots of information to start with. Consult the man pages for further reading. Again, I’m no expert, but at least (I think) I know what I’m doing. ;-)

Next, I ordered five hard disks with 1.5 TB each (the biggest back then), which gave me

a = s * (n – 1) = 1.5 TB * (5 – 1) = 6 TB ≈ 5.4 TiB.

a – available storage of array; s – capacity of smallest drive; n – number of drives

In other words: The space of one drive is “lost” (= used for parity). You can also see, that it is best to use drives of similar size, otherwise you waste the size-differece between any disk and the smallest disk. That is acceptable if you use old hard drives, which you eventually plan to replace one by one over time. Since I ordered new ones I bought five of the same size/model. If you are paranoid, you can chose drives of different vendors/models or chose drives models you already tested in the past to reduce the danger of multiple disk failures at the same time. I chose a model with only 5400 min^-1: It’s a little more quiet, emits less heat and consumes less power. Of course, this comes with a speed disadvantage on read/write access, but RAID5 itself (in theory) gives you a read/write benefit factor of n – 1 (with n – number of drives) for full stipe writes and with a powerful CPU. A five disk RAID5 array has a factor of 4, thus 80% write efficiency at best, compared to a RAID0 with the same number of drives. In reality this only applies to read operations. Writing is more complicated, because one write operation requires several read-paritycalculation-write cycles. Well, performance will be poor anyway, since the caches are disabled. It is still absolutely sufficient for my purpose: an external storage as a NAS alternative. However, today one would get a 6 TB RAID5 with only three 3 TB disks (which is minimum for RAID5) — or 12 TB RAID5 with the same number of drives (5) in the array. As long as there are not big enough drives that make RAID5 needless for me, I will replace each failed drive with a bigger one, either increasing space or shrinking the array if I don’t need more storage. Less disks reduces power consumption and more importantly the risk of data loss. The more disks you have the higher the possibilty for two drives failing at the same time (RAID5 can only cope with one drive failure at a time). For performance reasons it’s best to chose an odd number of drives: an even number for data plus 1 for parity.

If you know the failure rate of your chosen hard drive model, you can estimate the array failure rate by

f = n * (n – 1) * r^2

f – array failure rate; n – number of drives; r – failure rate of a single drive

with assumption of identical failure rates for each drive.

This estimation is rather pointless however, since you rarely get reliable information about failure rates from the manufacturers anyway. And if you do, they probably don’t reflect real life situations. Fact is, every hard drive will fail eventually, that’s why we use RAID in the first place! You have to be prepared for a disk failure anyway, no matter if the vendor advertises a low or high failure rate. If you still want to know more about it, Google’s field analysis might be useful.

Gathering the information and planning the actual setup was the hard part, the rest was easy. I took the lazy way out and first installed an old small hard drive, then installed Arch Linux on it and set up ssh. Next, I installed the five array disks and configured the BIOS, disabling any option I didn’t need to save power and resources. Finally, I removed any hardware that wasn’t needed anymore (including DVD-drive and graphics card).

To prevent data coruption from power loss, I disabled the cache of each drive in the array:

Then I formatted the first drive:

Choose “non-fs data” (to make sure, no autodetection tries to assamble the array if you boot a live distribution or install the array in a different system) and leave ~100 to 200 MB at the end unallocated. Of course you could also build your array right on top of the unformatted disks to simplify things, but then you always have to make sure that every disk you replace must at least have the same size as the smallest disk in the array. Hard drives of the exact same model still can have tiny size differeces, is your replacement drive smaller the RAID array can not be resynced. That is why it’s better to build a RAID on top of partitioned drives with a little free space at the end.

Now, copy the partition table to all other disks in the array:

after repeating the command for every unformatted disk of the array you end up with identically formatted disks (verify with $ sudo fdisk -l). Now build the array:

mdadm – software RAID implementation of GNU/Linux; –create – creates the array; –verbose – tell what’s happening; –level= – RAID-level to create; –metadata= – version of metadata to be used; –chunk= – chunk size in KiB; –raid-devices= – number of disks in array

The array is immediately created under the virtual device /dev/your_array, assambled and ready to use (in degraded mode). You can directly start using it while mdadm resyncs the array in the background. It can take a long time to restore parity, you can check the progress with:

Don’t wait for it to finish, just go on with the setup but keep in mind that operations are slow due to reconstruction of parity. Save your RAID configuration to /etc/mdadm.conf:

This step isn’t actually needed anymore, it’s more an old habbit. It adds basic information to the mdadm configuration file. In old versions of mdadm, it would read the information about the array from that file on assembly, which is deprecated. It may still provide useful information for an array that isn’t running.

The array can be formatted just like any other disk drive (you could even partition it), but there are a few things to keep in mind:

  • Due to the big size, not all filesystems are suited,
  • shrinking/growing of the filesystem size online should be supported,
  • do the “RAID math” before!

The math for chunk/block/stride/stripe size goes like this:

choose a chunk size for your RAID array and block size for your filesystem.

stride size = chunk size / block size

stripe width = stride size * number of data-bearing disks

To have multiple partitions on the array, it might be a good idea to use LVM. That complicates things and I only wanted one big ext4 FS anyway:

mke2fs -t ext4 – creates ext4 FS; -v – verbose; -m .1 – prevents non-privileged processes from filling up the FS completely; -b – block size in bytes; -E – extendet options for stride and stripe-width

After formatting, mount the filesystem…

… and copy your data to it from any client over ssh:

Or locally on the server:

To get files back from the server to your client:

After a reboot the array isn’t automatically reassambled. Unless you want to do that everytime by hand, write it to your rc.local or equivalent:

When the resync process of mdadm finishes, you are done! If restoring parity takes a long time (I remember mine took over half a day), you can do a # shutdown -h now at any time. Mdadm stops the reconstruction and proceeds the next time after boot.

You can always check the status of your RAID array with $ cat /proc/mdstat. It should then display information about all running arrays and at the end something like [UUUUU]. Every U stands for an “Up” disk in the array. If you see a _ (underscore) within the brackets, it stands for a failed disk, thus a degraded array. You will see this for example if you check your array during the initial resync process ([UUUU_]). That is normal — there is no redundancy until the reconstruction finishes.

Replacing a failed drive

When $ cat /proc/mdstat reports a failed disk, check the details:

Mark disk as faulty and remove it from the array:

If the faulty disk is still accessible to some degree, erase all data on it:

One iteration with a final write of zeros should be sufficiant (I know there are a lot of theories that suggest otherwise, just do as many iterations as you want).

Replace the disk (by one of similar/bigger size to/than the smallest disk in the array) and copy the partition table to it just like you did during the initial setup of the array (s.o.). Finally add the partition to the array:

Reconstruction starts immediately, check $ cat /proc/mdstat to see when it has finished.

Notes

If you want to install an OS on the RAID array, things work different and above setup will not work (for example: you need to install the bootloader on all drives, set the bootable flag during formatting, insert hooks into your initial ramdisk or build the respective modules into the kernel, /boot can not be on RAID5 array, etc.).

Also if you don’t like using rsync (great program for this job!) for every data transfer consider installing samba.

This probably isn’t the most elegant way for setting up a RAID storage server. As I already said, I’m not an expert in storage solutions. Educate yourself and modify above setup to your needs before trusting your data to my lousy RAID configuration. Although all my RAIDs work seamlessly until the present day, I don’t guarantee anything written in here.

Update: I found another guide that has some good additional information about the hardware. Especially the ECC memory part.

2 thoughts on “RAID5-Server to hold all your data — the NAS alternative with software RAID

  1. Very nice and sweet writeup.. a good overview from start to finish.

    Here is my line from /etc/fstab
    UUID=…. /royalmd ext4 rw,noatime,data=ordered,errors=remount-ro 0 2

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>