Main | Fully enabling discard »
Saturday
Feb152014

My case for BTRFS over ZFS

The computer industry always finds itself mired in FUD and heated debate over what can only be described as religion. The ZFS and BTRFS camps are clearly no different, so I figured I'd give some actual real world experience from the perspective of a Systems Engineer that has used both with great success.

 

First, let's address this site point-by-point. First off, my impression is that this comes from a person with very little understanding of BTRFS or the fact that it is a filesystem under active development whereas ZFS is stable and mature. That should be anybody's first bullet point, BTW. ZFS has been deployed in production environments for a long while with great success, while BTRFS is only now gaining real traction.

 

ZFS organizes file systems as a flexible tree

The complaint here seems to be the a snapshot of a subvolume is logically "located" beneath that volume in the filesystem tree. This is probably a consequence of design (b-tree and all) than anything else. But, since a subvolume is still a volume, the administrator could mount it anywhere they like. Including in a mountpoint that is not beneath the root node of BTRFS.

 

File system operations in ZFS can apply recursively

This is an intentional design decision and calling it out as a short-coming of BTRFS actually shows a lack of understanding of the filesystem's design principles and system administration in general. Typically, you want a command to operate on as small a set of data as possible, and the administrator handles the recursion with something like find or for. A matter of opinion, perhaps, but historical convention argues that ZFS does it wrong in this case.

The actual design here is that BTRFS's subvolumes are actual b-trees on their own. So, an operation that accesses root/vol1 would need to access an entirely different node to apply to root/vol2. When using BTRFS, it is best to think of subvols as actual seperate filesystems. They are stored in the same pool, but they are logically seperated from other subvolumes.

 

Policy set on ZFS file systems is inherited by their children

Again, because of the fact that subvolumes are trees unto themselves, this stands to reason. It's a consequence of the design. However, it doesn't mean it's not possible. For instance, options such as compression are applied to subvolumes when the parent volume is mounted.

Of course, with an IOCTL these options can be enabled or disabled on a per-item basis, so it really isn't a point worth considering.

 

ZFS auto-mounts file systems by default

This is actually a lack of understanding by the author. There is a conf file that informs ZFS of the ZFS filesystems that exist, and that they should be mounted when the first ZFS filesystem is mounted. To be clear, ZFS has actually replaced the behavior or /etc/fstab with its own config file. Even worse, the administrator can still use the fstab config file! This is fundamentally broken, IMO.

BTRFS, and all other filesystems with reasonable behavior, require an administrator to mount the filesystem. Of course, with an automount daemon, I'm sure it would be extremely trivial for any administrator to replicate this behavior.

Note, however, if a child subvol is created within a BTRFS subvol it will appear in the mounted filesystem. In other words, if I have a btrfs volume "/mnt/btrfs" and I create subvol "/mnt/btrfs/sub", when I mount /mnt/btrfs the subvolume /mnt/btrfs/sub appears in the tree.

The concern about changing mountpoints of subvols is only half true, since a subvol can be mounted to any location by passing the subvol= option to mount.

 

ZFS tracks used space per file system

This is actually quite confusing at first. ZFS shows you quite clearly how much space us being consumed by each ZFS mounted. BTRFS shows how much space is consumed in the entire pool. The du command works as expected, whereas df shows the raw free space. This can make life difficult for the administrator, because they must understand the semantics of their data protection strategy to calculate how much space can actually be allocated. There is a patch for this, but it's a serious pain for newcomers to BTRFS. This is a clear usability win for ZFS.

Because each item in BTRFS can have a different set of options applied to it, though, it does start to make a small bit of sense. Any file can (in the future) have an IOCTL call set its data protection and compression options, so predicting the amount of free space that is actually available for allocation given that detail would be very difficult.

This is also an interesting design question with room for discussion. Since storage in ZFS and BTRFS is actually a pool, should the administrator see the pool's view of storage, or the individual volume's view by default? Clearly, there should be an option to view each, but what should the default view be? I believe ZFS does it properly by showing the volume's values.

 

ZFS distinguishes snapshots from file systems

This seems to be an issue with the author's experience with BTRFS. A snapshot does not have to be a peer with its original. Snapshots, like any subvolume, can be given a destination. Personally, I like to create a subvolume that will act as a target for snapshots and create snapshots with a destination of that particular volume. In other words, I will have subvolume "/mnt/btrfs/snapshots". I will then create a snashot of "/mnt/btrfs/somevol" to "/mnt/btrfs/snapshots/somevol". In reality, I like to prepend or append the Unix timestamp of when the snapshot was taken as well, but the point is snapshots can be created in some other volume if the administrator chooses.

 

ZFS lets you specify compression and other properties per file system subtree

BTRFS allows these options to be specified on a per-file basis.

 

ZFS is more stable

This is true, as long as your definition of stable relates to deployment in production systems and code age. Since my definition of stable is exactly that, I agree. BTRFS is much newer, and because of this and the existence of the time continuum, ZFS is more mature. However, we should all agree the sometimes new products get to benefit from the discoveries made over time. New is often better than old. This is certainly true in computers.

Given the same time that ZFS was granted, I have absolutely no doubt that BTRFS will be as stable and mature if not more. Because BTRFS was developed as an open source project from day one, it has the anecdotal advantage of more contributors and testers than Sun allowed when creating ZFS. I say anecdotal because there's no hard data that I would be able to dig up to prove this one way or the other. Sun created ZFS behind closed doors, so its extremely early history is pretty much lost.

 

ZFS has RAIDZ

Nobody should be using RAID-Z. Period. As a user of ZFS, this was the very first item drilled into my head. Using RAID-Z is effectively equivalent to using RAID-5. RAID-Z2 is the rough equivalent of RAID-6, and there is even RAID-Z3. However, every single date reliability study ever done has noted that using duplicates provides far superior protection. Since BTRFS and ZFS advertise their main selling points as data reliability, why on earth would you ever choose the cheap-and-dirty way out?!

Even more important is performance. Using RAID-5 or RAID-6 is far slower than using RAID-10. This is true when using RAID-Z instead of a RAID-10 style protection scheme in ZFS, or when using RAID-10 protection instead of RAID-5 protection in BTRFS. By the way, RAID-5 and RAID-6 are available in BTRFS. Their existence was intended from the early development days, but it simply wasn't a real development priority for BTRFS since most consumers are using RAID-10 anyway. 

PLEASE don't use RAID-Z or RAID-5 for data protection.

 

ZFS has send and receive

BTRFS has had send and receive for quite some time at this point. It was one of the first "extra" features added, and it ceratinly existed before BTRFS was considered a stable format.

As far as ease of use, "btrfs send" is pretty equivalent to "zfs send".

 

ZFS is better documented

This is very subjective. Much of the great ZFS documentation has almost entirely disappeared after the oracle acquisition. Also, much of it is very very old. In contrast, the "btrfs" command is very well documented much like the "zfs" and "zpool" commands are with man pages and fantastic "--help" results.

On the other hand, technical documentation for BTRFS abounds. The Wikipedia page details its design very well, and the BTRFS page at kernel.org is a wonderful central repository of information for users and administrators alike. Again, this is subjective because as a systems engineer I am interested in details that most users and some administrators are not interested in. With regard to the information easily available to an administrator, I find no appreciable difference.

 

ZFS uses atomic writes and barriers

It is important to understand that the author is incorrect about barriers and atomic filesystem transactions, but correct that ZFS uses atomic writes and write barriers. Barriers do not mean data will not be lost or that I/O transactions are atomic. Instead, a barrier is a way to ensure, as the author noted, the order of writes reaching durable media. By ensuring the order of certain writes, you can provide some guarantees about data durability.

However, the author makes a common incorrect claim: "you will never lose a single byte of anything committed to the disk". This is absolutely incorrect. Instead, the claims made by journaled, barrier-write enabled filesystems is that you will not lose anything before the previously committed barrier-written transaction. If you yank power to a host that is writing data to durable media, you absolutely will lose data. The guarantee here is that you will only lose certain data, and should never be left in a position where a filesystem is corrupted beyond repair.

Understanding these facts is critical to a system administrator, and I have seen this incorrect statement far too often.

With regard to BTRFS, it also uses write barriers to make guarantees about its data and metadata. I'm not sure why the author was convinced that it didn't, but that's is the power of FUD I suppose.

 

ZFS will actually tell you what went bad in no uncertain terms, and help you fix it

It seems the author had no idea that "btrfs device stats" existed. The output of that command, when pointed to a btrfs mountpoint, is a list of devices that compose the pool and counters for several kinds of errors and corruption types.

Furthermore, "btrfs scrub status" will give you the results of the last scrub operation. This is very similar to ZFS, so I'm not sure why the confusion. zpool does give some interesting stats about the pool being interogated, but I personally don't need I/O stats re-implemented in a filesystem-specific way. On Linux, I have /proc and /sys. Those two tools alone replace much of what a system engineer needs from the zfs and zpool commands, so this is a wash to me or a win for BTRFS.

 

ZFS increases random read performance with advanced memory and disk caches

No, it does not. I've used ZFS in several scenarios that required differing I/O patterns to behave differently. I've used ARC, L2ARC on SSD. What I can say, without question is the following:

ARC is a terrible replacement for the VFS caching layer in Linux, because VFS has no idea what ARC is. More importantly, RAM allocated to ARC is memory that is considered actively used whereas almost all other filesystem caches will purge less recently used data when the operating system requires it for some other operation. This is a good design, because I'd rather the OOM killer not come by and kill a critical task when I have several gigabytes of data in RAM that could be re-read from disk. It's pretty rare that a NAS/SAN would require a massive, active working set in RAM, so I find the VFS caching layer to be adequate as it is now.

L2ARC is, to me, the recognition that adding faster disks (SSD) is easier than adding RAM to a host. L2ARC caches more frequently used blocks to a fast bit of storage so that when the data is requested again and they aren't cached in ARC, they will be read from L2ARC faster than the backing spindles could produce them. I've had very good experience with FusionIO cards used as L2ARC. Linux has an answer for this in recent kernels, spurred by the request that a caching block device be made available by the BTRFS developers. The benefit on Linux is that instead of this being a filesystem specific addition, the OS itself will add this feature and all filesystems will benefit!

In practice I have found that L2ARC only helps a bit, because random access typically happens...wait for it...randomly! Unless it happens randomly against the same blocks, caching the data is a useless exercise that adds more I/O to the underlying pool of spinners. I can only imagine this kind of thing is added when a developer realizes their filesystem offers terrible performance for database loads, and they incorrectly presume they can solve the problem by adding an additional caching layer.

 

ZFS increases random and synchronous write performance with log devices

No. It does not. The ZIL offers very little in the way of performance enhancements. It is only applicable to sync writes, which are uncommon in the real world of NAS/SAN servers. Many people presume the ZIL automagically combines random writes into larger sequential writes. Again, this is only true when the writes are synchronous and happen within the flush barrier period (default of 10 seconds).

There is an answer to this in Linux- The block I/O scheduler does some basic write combining before flushing I/O to the block devices below. Again, the Linux aproach is different in that if an idea is good, it should be applied to the system as a whole. ZFS's aproach comes from the fact that it was bolted onto Solaris and had to solve many of these problems on its own.

In addition to write combining in Linux's block layer, the previously mentioned block caching layer added to the kernel allows for a write-back or write-through cache. This cache is fundamentally different to the ZIL because it actually caches the block data that is destined for the filesystem. This data can be kept in cache for instances where recently written data is re-read. This effectively optimizes out the behavior, which can lead to actual performance increases for things like NAS/SAN storage.

 

ZFS supports thin-provisioned virtual block devices

This is true- BTRFS does not expose extents of storage as block devices. I find this to be a pretty significant feature for ZFS when used to build a SAN. An administrator could easily create an MD raid device, place LVM on top of it, and expose the LVM logical volume to a client. This is a pretty significant work around, though, and offers no protection from "bit rot". 

 

ZFS helps you share

 Yes, ZFS will inform the NFS or Samba daemon of the new export. I find this to be an unnecessary "optimization" and a pollution of the zfs command. With things like AppArmor profiles, it's also unlikely that the zfs or btrfs command would be allowed to modify some other configuration file in /etc by default for security reasons.

At the end of the day, the NFS and CIFS daemons on a server have nothing to do with the filesystem below, so I find no compelling reason to teach those tools how to speak to a file server. This is such a shallow "win" that it seems like a real stretch.

 

ZFS can save you terabytes by deduplicating your data

Well, it could save you that much if you had several terabytes of the same exact data. Unfortunately, ZFS's deduplication performance is abhorrent and uses so much RAM that it's foolish. To actually have 1TB of de-duplicated data would require between 2.5 and 640 GIGABYTES of RAM for the tables in ZFS. This is memory that is not available to your OS or to the filesystem for caching. It is simply a lookup table!

Worse yet, the CPU requirements for ZFS' online (real-time) deduplication is considerable. Many people building a NAS or SAN device purchase lower-end CPU because the only thing it will be used for is checksum calculation and compression. Adding the complexity of the de-duplication lookups adds real CPU requirements to the host's cost and tends to add quite a bit of latency in real world applications I've had.

This is not just a ZFS issue, though. I've also used online and offline deduplication on WAFL (NetApp) filesystems and on Windows Server hosts. Any time you enable deduplication, write speed decreases significantly.

BTRFS does offer deduplication, though. It has offered one type for quite some time- creating copies of files that are reflinks to the original creates a Copy-on-Write file whose only consumed blocks are those that are changed from the original. Additionally, BTRFS offers an off-line deduplication system, which will deduplicate files when the command is executed. This offers normal write speeds, while still allowing frequently duplicated data to be minimized on a schedule. I have attempted this with NAS/SAN appliances that served as a store for backups and virtual machines, and I can say it's usually not a feature anybody would want to use.

In the case of a backup storage system, deduplication is often performed by executing incremental backups. Saving network bandwidth and storage at the same time. This has a much higher dividend than just de-duplicating the storage system! In cases where virtual machine images are clones of a "golden master", I find cp --reflink to be more than adequate. This deduplicates the common operating system storage among the images and doesn't require any processing overhead or maintenance job.

 

 

So, in conclusion, this article was mostly fud or ignorance. I suspect it wasn't intentional, most people find themselves defending a particular position vehimently without realizing it and often for no other reason than it was the choice they had made.

BTRFS is a newer filesystem than ZFS with a significantly different design and some common features. Comparing filesystems is a natural consequence of having so many choices, but I think it's important to point out the fact that ZFS was designed to fix some of the shortcomings found in Solaris at the time, whereas BTRFS's developers haven't had to face many of those challenges.

I have no doubt that BTRFS will mature quickly over the next year or two, and will provide Linux with a first class checksum enabled, CoW filesystem that can be used in production without question. In my next entry, I will try to compare ZFS and BTRFS's designs and implementation details. Hopefully this will show where the two have common ground, explain some of the user-land differences, and hilight where each could borrow from the other to improve.

PrintView Printer Friendly Version

EmailEmail Article to Friend