Entries in Linux (4)


My case for BTRFS over ZFS

The computer industry always finds itself mired in FUD and heated debate over what can only be described as religion. The ZFS and BTRFS camps are clearly no different, so I figured I'd give some actual real world experience from the perspective of a Systems Engineer that has used both with great success.


First, let's address this site point-by-point. First off, my impression is that this comes from a person with very little understanding of BTRFS or the fact that it is a filesystem under active development whereas ZFS is stable and mature. That should be anybody's first bullet point, BTW. ZFS has been deployed in production environments for a long while with great success, while BTRFS is only now gaining real traction.


ZFS organizes file systems as a flexible tree

The complaint here seems to be the a snapshot of a subvolume is logically "located" beneath that volume in the filesystem tree. This is probably a consequence of design (b-tree and all) than anything else. But, since a subvolume is still a volume, the administrator could mount it anywhere they like. Including in a mountpoint that is not beneath the root node of BTRFS.


File system operations in ZFS can apply recursively

This is an intentional design decision and calling it out as a short-coming of BTRFS actually shows a lack of understanding of the filesystem's design principles and system administration in general. Typically, you want a command to operate on as small a set of data as possible, and the administrator handles the recursion with something like find or for. A matter of opinion, perhaps, but historical convention argues that ZFS does it wrong in this case.

The actual design here is that BTRFS's subvolumes are actual b-trees on their own. So, an operation that accesses root/vol1 would need to access an entirely different node to apply to root/vol2. When using BTRFS, it is best to think of subvols as actual seperate filesystems. They are stored in the same pool, but they are logically seperated from other subvolumes.


Policy set on ZFS file systems is inherited by their children

Again, because of the fact that subvolumes are trees unto themselves, this stands to reason. It's a consequence of the design. However, it doesn't mean it's not possible. For instance, options such as compression are applied to subvolumes when the parent volume is mounted.

Of course, with an IOCTL these options can be enabled or disabled on a per-item basis, so it really isn't a point worth considering.


ZFS auto-mounts file systems by default

This is actually a lack of understanding by the author. There is a conf file that informs ZFS of the ZFS filesystems that exist, and that they should be mounted when the first ZFS filesystem is mounted. To be clear, ZFS has actually replaced the behavior or /etc/fstab with its own config file. Even worse, the administrator can still use the fstab config file! This is fundamentally broken, IMO.

BTRFS, and all other filesystems with reasonable behavior, require an administrator to mount the filesystem. Of course, with an automount daemon, I'm sure it would be extremely trivial for any administrator to replicate this behavior.

Note, however, if a child subvol is created within a BTRFS subvol it will appear in the mounted filesystem. In other words, if I have a btrfs volume "/mnt/btrfs" and I create subvol "/mnt/btrfs/sub", when I mount /mnt/btrfs the subvolume /mnt/btrfs/sub appears in the tree.

The concern about changing mountpoints of subvols is only half true, since a subvol can be mounted to any location by passing the subvol= option to mount.


ZFS tracks used space per file system

This is actually quite confusing at first. ZFS shows you quite clearly how much space us being consumed by each ZFS mounted. BTRFS shows how much space is consumed in the entire pool. The du command works as expected, whereas df shows the raw free space. This can make life difficult for the administrator, because they must understand the semantics of their data protection strategy to calculate how much space can actually be allocated. There is a patch for this, but it's a serious pain for newcomers to BTRFS. This is a clear usability win for ZFS.

Because each item in BTRFS can have a different set of options applied to it, though, it does start to make a small bit of sense. Any file can (in the future) have an IOCTL call set its data protection and compression options, so predicting the amount of free space that is actually available for allocation given that detail would be very difficult.

This is also an interesting design question with room for discussion. Since storage in ZFS and BTRFS is actually a pool, should the administrator see the pool's view of storage, or the individual volume's view by default? Clearly, there should be an option to view each, but what should the default view be? I believe ZFS does it properly by showing the volume's values.


ZFS distinguishes snapshots from file systems

This seems to be an issue with the author's experience with BTRFS. A snapshot does not have to be a peer with its original. Snapshots, like any subvolume, can be given a destination. Personally, I like to create a subvolume that will act as a target for snapshots and create snapshots with a destination of that particular volume. In other words, I will have subvolume "/mnt/btrfs/snapshots". I will then create a snashot of "/mnt/btrfs/somevol" to "/mnt/btrfs/snapshots/somevol". In reality, I like to prepend or append the Unix timestamp of when the snapshot was taken as well, but the point is snapshots can be created in some other volume if the administrator chooses.


ZFS lets you specify compression and other properties per file system subtree

BTRFS allows these options to be specified on a per-file basis.


ZFS is more stable

This is true, as long as your definition of stable relates to deployment in production systems and code age. Since my definition of stable is exactly that, I agree. BTRFS is much newer, and because of this and the existence of the time continuum, ZFS is more mature. However, we should all agree the sometimes new products get to benefit from the discoveries made over time. New is often better than old. This is certainly true in computers.

Given the same time that ZFS was granted, I have absolutely no doubt that BTRFS will be as stable and mature if not more. Because BTRFS was developed as an open source project from day one, it has the anecdotal advantage of more contributors and testers than Sun allowed when creating ZFS. I say anecdotal because there's no hard data that I would be able to dig up to prove this one way or the other. Sun created ZFS behind closed doors, so its extremely early history is pretty much lost.



Nobody should be using RAID-Z. Period. As a user of ZFS, this was the very first item drilled into my head. Using RAID-Z is effectively equivalent to using RAID-5. RAID-Z2 is the rough equivalent of RAID-6, and there is even RAID-Z3. However, every single date reliability study ever done has noted that using duplicates provides far superior protection. Since BTRFS and ZFS advertise their main selling points as data reliability, why on earth would you ever choose the cheap-and-dirty way out?!

Even more important is performance. Using RAID-5 or RAID-6 is far slower than using RAID-10. This is true when using RAID-Z instead of a RAID-10 style protection scheme in ZFS, or when using RAID-10 protection instead of RAID-5 protection in BTRFS. By the way, RAID-5 and RAID-6 are available in BTRFS. Their existence was intended from the early development days, but it simply wasn't a real development priority for BTRFS since most consumers are using RAID-10 anyway. 

PLEASE don't use RAID-Z or RAID-5 for data protection.


ZFS has send and receive

BTRFS has had send and receive for quite some time at this point. It was one of the first "extra" features added, and it ceratinly existed before BTRFS was considered a stable format.

As far as ease of use, "btrfs send" is pretty equivalent to "zfs send".


ZFS is better documented

This is very subjective. Much of the great ZFS documentation has almost entirely disappeared after the oracle acquisition. Also, much of it is very very old. In contrast, the "btrfs" command is very well documented much like the "zfs" and "zpool" commands are with man pages and fantastic "--help" results.

On the other hand, technical documentation for BTRFS abounds. The Wikipedia page details its design very well, and the BTRFS page at kernel.org is a wonderful central repository of information for users and administrators alike. Again, this is subjective because as a systems engineer I am interested in details that most users and some administrators are not interested in. With regard to the information easily available to an administrator, I find no appreciable difference.


ZFS uses atomic writes and barriers

It is important to understand that the author is incorrect about barriers and atomic filesystem transactions, but correct that ZFS uses atomic writes and write barriers. Barriers do not mean data will not be lost or that I/O transactions are atomic. Instead, a barrier is a way to ensure, as the author noted, the order of writes reaching durable media. By ensuring the order of certain writes, you can provide some guarantees about data durability.

However, the author makes a common incorrect claim: "you will never lose a single byte of anything committed to the disk". This is absolutely incorrect. Instead, the claims made by journaled, barrier-write enabled filesystems is that you will not lose anything before the previously committed barrier-written transaction. If you yank power to a host that is writing data to durable media, you absolutely will lose data. The guarantee here is that you will only lose certain data, and should never be left in a position where a filesystem is corrupted beyond repair.

Understanding these facts is critical to a system administrator, and I have seen this incorrect statement far too often.

With regard to BTRFS, it also uses write barriers to make guarantees about its data and metadata. I'm not sure why the author was convinced that it didn't, but that's is the power of FUD I suppose.


ZFS will actually tell you what went bad in no uncertain terms, and help you fix it

It seems the author had no idea that "btrfs device stats" existed. The output of that command, when pointed to a btrfs mountpoint, is a list of devices that compose the pool and counters for several kinds of errors and corruption types.

Furthermore, "btrfs scrub status" will give you the results of the last scrub operation. This is very similar to ZFS, so I'm not sure why the confusion. zpool does give some interesting stats about the pool being interogated, but I personally don't need I/O stats re-implemented in a filesystem-specific way. On Linux, I have /proc and /sys. Those two tools alone replace much of what a system engineer needs from the zfs and zpool commands, so this is a wash to me or a win for BTRFS.


ZFS increases random read performance with advanced memory and disk caches

No, it does not. I've used ZFS in several scenarios that required differing I/O patterns to behave differently. I've used ARC, L2ARC on SSD. What I can say, without question is the following:

ARC is a terrible replacement for the VFS caching layer in Linux, because VFS has no idea what ARC is. More importantly, RAM allocated to ARC is memory that is considered actively used whereas almost all other filesystem caches will purge less recently used data when the operating system requires it for some other operation. This is a good design, because I'd rather the OOM killer not come by and kill a critical task when I have several gigabytes of data in RAM that could be re-read from disk. It's pretty rare that a NAS/SAN would require a massive, active working set in RAM, so I find the VFS caching layer to be adequate as it is now.

L2ARC is, to me, the recognition that adding faster disks (SSD) is easier than adding RAM to a host. L2ARC caches more frequently used blocks to a fast bit of storage so that when the data is requested again and they aren't cached in ARC, they will be read from L2ARC faster than the backing spindles could produce them. I've had very good experience with FusionIO cards used as L2ARC. Linux has an answer for this in recent kernels, spurred by the request that a caching block device be made available by the BTRFS developers. The benefit on Linux is that instead of this being a filesystem specific addition, the OS itself will add this feature and all filesystems will benefit!

In practice I have found that L2ARC only helps a bit, because random access typically happens...wait for it...randomly! Unless it happens randomly against the same blocks, caching the data is a useless exercise that adds more I/O to the underlying pool of spinners. I can only imagine this kind of thing is added when a developer realizes their filesystem offers terrible performance for database loads, and they incorrectly presume they can solve the problem by adding an additional caching layer.


ZFS increases random and synchronous write performance with log devices

No. It does not. The ZIL offers very little in the way of performance enhancements. It is only applicable to sync writes, which are uncommon in the real world of NAS/SAN servers. Many people presume the ZIL automagically combines random writes into larger sequential writes. Again, this is only true when the writes are synchronous and happen within the flush barrier period (default of 10 seconds).

There is an answer to this in Linux- The block I/O scheduler does some basic write combining before flushing I/O to the block devices below. Again, the Linux aproach is different in that if an idea is good, it should be applied to the system as a whole. ZFS's aproach comes from the fact that it was bolted onto Solaris and had to solve many of these problems on its own.

In addition to write combining in Linux's block layer, the previously mentioned block caching layer added to the kernel allows for a write-back or write-through cache. This cache is fundamentally different to the ZIL because it actually caches the block data that is destined for the filesystem. This data can be kept in cache for instances where recently written data is re-read. This effectively optimizes out the behavior, which can lead to actual performance increases for things like NAS/SAN storage.


ZFS supports thin-provisioned virtual block devices

This is true- BTRFS does not expose extents of storage as block devices. I find this to be a pretty significant feature for ZFS when used to build a SAN. An administrator could easily create an MD raid device, place LVM on top of it, and expose the LVM logical volume to a client. This is a pretty significant work around, though, and offers no protection from "bit rot". 


ZFS helps you share

 Yes, ZFS will inform the NFS or Samba daemon of the new export. I find this to be an unnecessary "optimization" and a pollution of the zfs command. With things like AppArmor profiles, it's also unlikely that the zfs or btrfs command would be allowed to modify some other configuration file in /etc by default for security reasons.

At the end of the day, the NFS and CIFS daemons on a server have nothing to do with the filesystem below, so I find no compelling reason to teach those tools how to speak to a file server. This is such a shallow "win" that it seems like a real stretch.


ZFS can save you terabytes by deduplicating your data

Well, it could save you that much if you had several terabytes of the same exact data. Unfortunately, ZFS's deduplication performance is abhorrent and uses so much RAM that it's foolish. To actually have 1TB of de-duplicated data would require between 2.5 and 640 GIGABYTES of RAM for the tables in ZFS. This is memory that is not available to your OS or to the filesystem for caching. It is simply a lookup table!

Worse yet, the CPU requirements for ZFS' online (real-time) deduplication is considerable. Many people building a NAS or SAN device purchase lower-end CPU because the only thing it will be used for is checksum calculation and compression. Adding the complexity of the de-duplication lookups adds real CPU requirements to the host's cost and tends to add quite a bit of latency in real world applications I've had.

This is not just a ZFS issue, though. I've also used online and offline deduplication on WAFL (NetApp) filesystems and on Windows Server hosts. Any time you enable deduplication, write speed decreases significantly.

BTRFS does offer deduplication, though. It has offered one type for quite some time- creating copies of files that are reflinks to the original creates a Copy-on-Write file whose only consumed blocks are those that are changed from the original. Additionally, BTRFS offers an off-line deduplication system, which will deduplicate files when the command is executed. This offers normal write speeds, while still allowing frequently duplicated data to be minimized on a schedule. I have attempted this with NAS/SAN appliances that served as a store for backups and virtual machines, and I can say it's usually not a feature anybody would want to use.

In the case of a backup storage system, deduplication is often performed by executing incremental backups. Saving network bandwidth and storage at the same time. This has a much higher dividend than just de-duplicating the storage system! In cases where virtual machine images are clones of a "golden master", I find cp --reflink to be more than adequate. This deduplicates the common operating system storage among the images and doesn't require any processing overhead or maintenance job.



So, in conclusion, this article was mostly fud or ignorance. I suspect it wasn't intentional, most people find themselves defending a particular position vehimently without realizing it and often for no other reason than it was the choice they had made.

BTRFS is a newer filesystem than ZFS with a significantly different design and some common features. Comparing filesystems is a natural consequence of having so many choices, but I think it's important to point out the fact that ZFS was designed to fix some of the shortcomings found in Solaris at the time, whereas BTRFS's developers haven't had to face many of those challenges.

I have no doubt that BTRFS will mature quickly over the next year or two, and will provide Linux with a first class checksum enabled, CoW filesystem that can be used in production without question. In my next entry, I will try to compare ZFS and BTRFS's designs and implementation details. Hopefully this will show where the two have common ground, explain some of the user-land differences, and hilight where each could borrow from the other to improve.


Fully enabling discard

To those of you using a Linux kernel greater than 2.6.33 and an SSD, you should be enabling discard support. Enabling discard will tell the underlying device to enable TRIM, UNMAP, or other physical block erase command. In general, using discard on your SSD will cause empty cells to be zeroed out. This usually happens in the background so they are clear when the cell is re-written at a later date. This saves you several operations at write time, which equates to somewhere around a 3x write speedup on an aged SSD.


Filesystem Discard

If you're using the EXT4 filesystem, simply add 'discard' to your mount options in /etc/fstab.

/dev/mapper/ssd-root / ext4 discard,errors=remount-ro 0 1

When the device is mounted again, discards will be enabled. If you don't want to reboot, you can always make the necessary changes and then execute:

$ mount -o remount


The discard option will only work with the ext4 filesystem and not ext2 or ext3. If you add the discard option to an ext2/3 fstab entry, the filesystem will be mounted read-only which will prevent proper booting in some cases!

An alternative to enabling TRIM on ext4 is to run BTRFS, which also replaces LVM or other volume managers. Discard is enabled by default in most cases when mounting a BTRFS volume.



As of kernel version 2.6.38, LVM2 supports enabling discard. Enabling it will cause LVM to issue a discard command when a logical volume is no longer using a physical volume's space. Documentation isn't very explicit about this support regarding filesystems releasing blocks, but at the very least when an lvreduce or lvremove is executed, the physical device will be sent the appropriate discard command.

To enable discard support in LVM2, simply change the issue_discards value in the /etc/lvm/lvm.conf file.

devices {


    issue_discards = 1




DM-Crypt Discard

If you are using Linux kernel version 3.1 or later, you may enable discard for encrypted block devices. Cryptsetup allows you to specify a discard option on the command line when creating an encrypted device, but you may also enable it by adding the 'discard' option to /etc/crypttab entries.

sda3_crypt UUID=a73246e5-1337-dead-beef-abcdef543210 none luks,discard

It has been mentioned that using the discard command with an SSD may lead to data vulnerability. The theory is that if discarded data can be recovered from the physical device, information about device free and used space, device size, and filesystem type could be exposed. For the most part, on an SSD, a discard command becomes a TRIM, which actually erases the storage cell. On a thinly provisioned device, such as a SAN, these blocks may not actually be zeroed. I think it's unlikely that exposing information about the kind of filesystem would risk actually encrypted data.


The build begins

At work, I've recently been facing a two-pronged problem with our server infrastructure. It's been weighing on my mind for a while, and recently I decided that something had to be done. I wasn't sure what I would end up doing, but I needed to act carefully and quickly. These aren't usually two things that go hand-in-hand when planning out IT infrastructure whose life expectency needs to be five or more years!

The first prong of our problem was storage space. To be more specific, we're running out of it. We have a SAN at our HQ location that we use to host two VMware vSphere 4.x instances, along with data for our network monitoring system (OpenNMS) and a few other odds and ends. I calculated that within six months we would be at 100% of the capacity of that SAN without adding any new projects that required storage space. The problem here is that there's a project currently in the works that requires not only the rest of the space in this SAN, but an additional 3x its original capacity! To make matters that much worse, when I projected storage needs out to 18 months, we were beyond the maximum capacity of the SAN in question. If I were to band-aid this problem, it would cost around $10,000 now, and be completely useless in just over a year. So, while $10,000 is an absolute steal when it comes to adding the capacity I wanted to a SAN, I felt like I couldn't justify a $10,000 hit for 12 months of useful service. There's so much more I'd rather do with that money!


Our second piece of the problem was parformance in another realm. We are already running up against the parformance capabilities of the SAN we were looking to upgrade! So this meant that while spending the money to upgrade its data capacity would solve one problem, it would actually make the second problem worse! And the more we rely on this SAN, the worse the performance problem gets. In fact, when I charted out the projected performace of the existing SAN over the next year, it would be so over utilized that everything we ask it to do would take more than 24 hours to complete. This is a pretty severe problem when your business hours are 7am to 7pm, needless to say.


So I went back to the drawing board. I puzzled over several possible solutions, but none of them really made me happy. I could spend $20,000 on a second SAN and use the two of them in parallel, I could steal some capacity from our backup SAN for our customer data, I could just add capacity to the existing SAN and hope the performance calculations were wrong, and of course I could do nothing. The last option was by FAR the best option on the table at the time.


Then it dawned on me. I had experimented with some different servers here, building a "home grown" SAN. Servers were cheap, and I could use any brand of hard drive I wanted. I wasn't locked into a single vendor, and the price tag was extremely appealing. For $10,000 you can build a Linux (or Windows) server with so many disks in it that you actually need specialy hardware to handle them all! I had aready been mulling over the idea of building a small Linux SAN for work, so why not build a massive one instead? I mulled this over for months until the opportunity presented itself. And by that I mean my back was against the storage wall, and a decision had to be made in just a few weeks.


Well, that day has come! While ordering software for my company, I negotiated a deal that saved around $3,500. This afforded me the opportunity to build a Linux server with room for 42 disk drives, a second CPU, more RAM than you could shake a stick at, and all the bells and whistles I wanted. This new system will give me the ability to expand throughput by adding much faster network cards and disk controllers, while allowing me to add affordable storage any time I want! What's better is that I will be modeling my system after ones used at CERN and LLNL- two massive agencies that have some serious storage needs.


Finally, after months of planning and research, the order has been placed and the parts are on their way. I'll be using 20 1TB hard drives, two or four 60GB SSD drives, 6GB of RAM (initially), a single Intel E5620 Xeon CPU, a 10Gbit ethernet card, and an Areca 24-port 6G SAS card in JBOD mode. The plan is to build a ZFS pool out of the disks, tune it to use the SSDs for data access acceleration, turn on the data de-duplication features, and benchmark the HELL out of it. Once I'm happy with the results, I'll be carving storage out of the pool and migrating things off the old SAN. Once everything is moved off the old SAN, I will add it to the ZFS pool as slower storage used for less frequently accessed data. As more storage is needed, the system will expand with the addition of $1,000 external disk chassis that can hold up to 24 disks each. The total disk limit of a single Areca card is 128 disks, and the system will be physically limited to two cards. I think 256 3+TB drives would be more than enough space to hold us over for the forseeable future.


I plan to keep good notes during this process, and I'm hoping to update this blog with the details as I make progress. I'll also be using the Phoronix benchmark suite to test access to the ZFS pool on the server itself and on iSCSI, CIFS, NFS, and AoE clients, so be prepared for lots of boring numbers.


Upcoming Ubuntu 9.04 release

Canonical and the Ubuntu community will release the next version of Ubuntu Linux- codenamed Jaunty Jackalope. This new version will be coming with many new features, and quite a bit of polish on the surface. Some of these changes will cause issues for some people running in certain hardware situations, such as those that require the use of third-party drivers. Many vendors do work hard to get drivers available as soon as possible, and several are already packaged up with the testing releases, but there are still those that drag their heels.

New Features

  • Package Updates - As with all distribution version updates, there are a host of packages that get version updates. These range from common utilities to linked libraries. For details about updated packages, take a look at the jaunty-changes mailing list at https://lists.ubuntu.com/mailman/listinfo/jaunty-changes.
  • X.Org Server 1.6 - X.org is the underlying graphical system of Ubunbu Linux. Version 1.6 now ships with DRI2 to manage graphics rendering (including 3D), X Input 1.5 which allows automatic detection and configuration of input devices such as mice, keyboards, pens, touch screens, etc., A new pointer acceleration system, and RandR 1.3, which manages the resizing and rotation of screens. In addition to these updates, there are several smaller but no less critical changes. Many of these increase performance and stability across the board, and provide driver updates to more fully support new hardware versions.
  • Optimized font sizes - Ubuntu 9.04 will detect your display settings and automatically set your font's dot-per-inch size. This will make text more easily readable and consistently sixed across an array of devices. If you prefer a customized side, you can also manually set this value.
  • New notifications - I am personally excited about this feature. Ubuntu now offers a standardized notification framework to other applications, making notification messages appear and behave in a predictable way. To see a preview of these notifications, please check out this example on Mark Shuttleworth's blog.
  • Updated kernel - Jaunty will ship with kernel version 2.6.28. This kernel version comes with a long list of newly supported devices, EXT4 filesystem which makes disk access in most cases faster and more reliable, and Intel's new Graphics Execution Manager (GEM), which provides a new system for managing the memory of graphics systems. Naturally, quite a bit of work went into fixing bugs, updating existing drivers, and optimizing performance.
  • EXT4 filesystem - The EXT4 filesystem is an update to the EXT3 filesystem, but also so much more. EXT4 has been discussed at length on every Linux forum and benchmarking site, but the breakdown is this. Files are allocated in a new way, which makes creating and deleting significantly faster. This new allocation method also requires less data to describe files themselves. In addition, data is cataloged with 64bit addresses, which means the amount of usable storage is orders of magnitude larger than the number of devices you can attach to your system.

Unfortunately, a few features were cut. Most significant to me, personally, was the encrypted home directory. The basic rundown of this feature was that when you installed Ubuntu, it would ask you for a password and then create an encrypted directory in your home folder. This directory could be used to store sensitive documents, data, and things of the like. If your disk or computer were ever stolen, you didn't have to worry about that information getting into the wrong hands- without your password even the NSA wouldn't be able to access that data. Due to some "outstanding issues", the feature has been removed from the testing releases and will not be put into the final version for download.

In all, Ubuntu has been shaping up very nicely over the past few years that I've been using it. At this point, the system is completely ready for use by any computer user, and that is a claim I will stake my reputation on. There are very few situations where Ubuntu can not fully satisfy any need a computer user has, but just like switching from a PC to a Mac, you need to be ready to change the way you think about using computers.

If you haven't tried Ubuntu yet, I encourage you to head over to www.ubuntu.com, download a copy, and boot up a live-cd. This will allow you to try Ubuntu without installing it on your PC. If you like it, I say back up your data and take the plunge. If you don't, I'd love to hear what you feel is wrong with it.