OpenZFS 2.1 is out'let's talk about its brand-new dRAID vdevs

Friday afternoon saw the release of version 2.1.0 by OpenZFS, our beloved "it's complex but it's worthwhile" filesystem. This new release is compatible both with FreeBSD 12.2-RELEASE or higher and Linux kernels 3.10-5.13. This release includes several performance improvements and a few new features, primarily aimed at enterprise users and other highly advanced uses.Today we will be focusing on the dRAID topology, which is arguably the most important feature of OpenZFS 2.1.0. dRAID was under active development at least since 2015 and reached beta status in November 2020 when it was merged into OpenZFS master. It has been extensively tested in many major OpenZFS development shops since then. This means that today's release is not yet in production, but "new" in the sense of untested.Overview of Distributed RAID (dRAID).You might have thought ZFS topology was complicated. Distributed RAID (dRAID), a completely new vdev topology, was first presented at the 2016 OpenZFS Dev Summit.Administrators specify a number data, parity and hotspare sector per stripe when creating a dRAID vdev. These numbers do not depend on the actual number of disks in the virtual device. This is illustrated in the following example taken from the dRAID Foundations documentation.root@box wwn-1 wwn-1... wwn-2 wwn-3 ONLINE: 0/0/0/wwn-6 ONLINE: 0/0/wwn-7 ONLINE: 0/0/wwn-8 ONLINE: 0/0/wwn-9 ONLINE: 0/0/wwn-8 ONLINE: 0/0/wwn-9 ONLINE: 0/0/wwn-8 ONLINE:0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/wwn-8 ONLINE:0/wn-8 ONLINE:0:0Topology for dRAIDThe above example shows eleven disks: from wwn-01 to wwn-04. One dRAID Vdev was created with 4 data devices, 2 parity devices and 1 spare stripein condensed. This is a draid2 :4:1Although there are eleven disks total in the draid2;4:1, only six of them are used in each data stripe and one in each physical section. The on-disk layout for a draid2;4:1 would look like this in a world with perfect vacuums, frictionless surfaces and spherical poultrys.0 1 2 3 4 6 7 8 9 . s . . . . . . . . . ?. s ?. . . . . . . . . ?. s ?. . . . . . . . . ?. s ?. . . . . . . . . ?. s ?. . . . . . . . . ?. s ?. . . . . . . . . . ?.dRAID effectively takes the concept of "diagonal Parity" RAID one more step. RAID5 was not the first parity RAID Topology. It was RAID3, where parity was distributed across the array on a fixed drive.RAID5 eliminated the fixed parity drive and instead distributed parity across all array disks. This allowed for significantly faster random write operations than RAID3.AdvertisementdRAID extends this concept by distributing parity among all disks instead of putting it all on one or two fixed disks. In a dRAID Vdev, if a disk crashes, the parity or data sectors that lived on the disk fail are copied to the reserved spare (s) for each stripe.Let's look at the simplified diagram and see what happens when a disk is removed from an array. In this simplified diagram, there are stripes that indicate the initial failure.0 1 2 4 5 7 8 9 A s. s ?. . . . .We resilverize the reserve capacity that was previously reserved.0 1 2 4 5 7 8 9 A P D Di D Do P D E D D T P P P DD D L D P PD D F D P d D W D D C D D I D P d D D U D P p D d D P p D e D? P ped D. s ?. . . . .These diagrams are simplified. The complete picture includes rows, slices, groups and slices. We won't attempt to go into this. Randomly permuting the logical layout allows for more even distribution of things over drives, based on offset. If you are interested in the smallest details, please refer to the original code commit.It is important to note that dRAID does not support dynamic stripe widths as supported by RAIDz1 or RAIDz2 Vdevs. A draid2;4:1 vdev such as the one above, which uses 4kn disks for each metadata block, will consume 24KiB on-disk, while a traditional RAIDz2 vdev with six-wide would require only 12KiB. This discrepancy is worse if you increase the values of D+P Geta draid2-8:1 would need 40KiB to store the same metadata block.The special allocation vdev can be very useful in pools that have dRAID.vdevs. A pool with draid2;8:1 and a three wide special need to store a 4KiB metablock will do this in only 12KiB instead of the 40KiB required by the draid2 :8:1.Performance, fault tolerance, recovery and dRAID performanceA dRAID device will generally perform the same as a group of equivalent traditional vdevs. For example, a nine-disk draid1/2:0 vdev with nine disks will perform nearly identically to a three-wide RAIDz1 pool. Similar fault tolerance applies to both dRAID vdevs. You are guaranteed to survive one failure with p=1 just like you are with RAIDz1 Vdevs.AdvertisementNot identical, but similar fault tolerance was stated. The traditional RAIDz1 pool of three RAIDz1 vdevs of 3-wide size is not guaranteed to survive a single failure. However, it will likely survive another. As long as the second failed disk is not part of the same Vdev, everything is fine.A nine-disk draid1 to 2 draid1 to 2 will almost certainly result in a second disk failure. If that happens before resilvering, it will most likely kill the vdev and the pool. There are no fixed groups that can be used to replace stripes. Therefore, another disk failure will most likely cause additional stripes to become unusable, regardless of which one is failing first.This compensates for a slightly lower fault tolerance by significantly faster resilver times. The chart at the top shows that resilvering onto traditional, fixed spares takes approximately thirty hours, regardless of how the dRAID Vdev is configured. However, it can take less than one hour to resilver onto distributed spare capacity.This is because the write load splits up when resilvering onto the distributed spare. Resilvering onto an older spare disk is the bottleneckreads are from all disks in vdev but all writes must be completed by the spare. When resilvering onto a distributed spare capacity, both the read and write workloads will be split among all surviving disks.Distributed resilvers can also be sequential resilvers, instead of healing resilvers. ZFS can copy over all affected sector without having to worry about which blocks they belong to. The healing resilvers must scan the block tree, which results in a random rather than sequential read workload.The resilver operation that is performed on the failing disk will be healing and not sequential. It will slow down the write performance of each replacement disk rather than the entire vdev. The time it takes to finish that operation is less important, as the vdev was not already in a compromised state.ConclusionsDistributed RAID Vdevs are primarily intended for large storage serversOpenZFS design and testing centered largely on 90-disk systems. Traditional vdevs, spares, and other tools are still very useful on a smaller scale.Storage novices should be cautious with draid as it is a more complicated layout than traditional vdevs. The speed of resilvering can be amazing, but draid suffers from its fixed-length stripes which cause a decrease in compression and performance.