October 2007
Mon Tue Wed Thu Fri Sat Sun
« Sep   Dec »

Day 2007.10.07

You don’t have to hate ZFS to know it’s wrong for you

In a well-considered and thought-out blog post, file system engineer Drew Thaler takes issue with our previous post blasting AppleInsider for, without understanding file systems, touting ZFS as the cure for all storage problems, real and imagined—the same kind of uninformed tripe that makes people who know nothing about file systems crave ZFS and think they’re somehow “cheated” by being “stuck” with HFS Plus.

Thaler knows his stuff, but in mistaking our disdain for ZFS rumors as “ZFS hate,” he minimizes the real and significant problems that this advanced file system would bring to today’s Macintosh computers. Of course, part of the problem is that the post is an abbreviated argument, not our entire case.

At the moment, subscribing to the free trial of MWJ gets you a full copy of MWJ 2007.06.11, with more than sixteen detailed pages on why ZFS is no replacement now or in the foreseeable future for HFS Plus. We stand by this article in its entirety.

We don’t hate ZFS. It’s a remarkable and advanced file system, with a lot of concepts that make plenty of sense for 20TB server storage. These same concepts make absolutely zero sense in today’s Macs. Let’s explore the complaints one by one, and look at HFS Plus.

ZFS is great specifically because it takes two things that modern computers tend to have a surplus of — CPU time and hard disk space — and borrows a bit of it in the name of data integrity and ease of use.

MWJ 2007.06.11:

Processes like background storage pool scrubbing are probably fine for stand-alone servers, but additional compression, checksumming, scrubbing, mirroring, fatzap tracing, and copy-on-write time could be a killer to someone trying to encode real-time video effects on a MacBook Pro. A new “default” file system that takes more disk space _and_ more CPU time to store the same data is not a win for a portable computer–and in case you hadn’t noticed, Apple’s MacBook sales continue to outpace desktop system sales by a margin that only grows by the quarter. As one staffer noted to [MWJ], Mac users already complain about CPU and disk performance attributed to Spotlight. Imagine adding that to everything.

Drew Thaler’s post:

HFS Plus administration is simple, but it’s also inflexible. It locks you into what I call the “big floppy” model of storage. This only gets more and more painful as disks get bigger and bigger. ZFS administration lets you do thing that HFS Plus can only dream of.

And later on:

How dumb do you need to be to willfully ignore the fact that the things that are bleeding-edge today will be commonplace tomorrow? Twenty gigabyte disks were massive server arrays ten years ago. Today I use a hard drive ten times bigger than that just to watch TV.

First, we didn’t say “20GB disks” were “massive server arrays.” We said “20TB”, or twenty terabytes. We’ll stand by that.

Second, we don’t think HFS Plus “dreams” of shorter battery life and requiring more disk space to store the same data. ZFS handles very large storage pools efficiently, in ways that standard block-based file systems can’t. At some point, that may become a great trade-off. Today, when people still stupidly “clean” their apps of alternate processor architecture code or foreign language support to save disk space, it would be a customer relations nightmare.

The fatzap storage structure requires an additional 128KB of disk space for any attribute that’s not part of the basic ZFS model, including most of those that Mac OS X software use. MWJ 2007.06.11 says:

A standard Mac OS X installation can easily include 600,000 separate files. If just 20% of them have attributes that require a fatzap object, ZFS would require 120,000 fatzap objects at 128KB each, or more than 15GB of additional disk space above what an HFS Plus disk would require for the same data storage.

Apple will wind up having sold more than seven million Macs in fiscal year 2007. Do you want to answer the phones and E-mail when 6.7 million of those owners want an explanation for why they had to give up 15GB or more of their hard disk to support a file system that geeks with blogs say will be really cool several years after you’ve upgraded to a completely new computer?

Yeah, today’s hard disks are 500GB each for a reasonable price. Just six years ago, a state of the art PowerBook G4 had a 20GB hard drive. How many people purchasing a 667MHz PowerBook G4 in 2001 would have been willing to give up 600MB or more of hard disk space to support a file system that might pay dividends six years later? We’d venture “not many.” And HFS Plus was already nearly four years old by that point.

As for the “big floppy” model, well, Thaler doesn’t explain what he means, but we’re guessing he means that since each disk can only be divided into so many allocation blocks, then as volumes get bigger, each allocation block gets bigger. This was the major flaw in HFS for its day—by the time disks started getting big, every allocation block was taking huge amounts of disk space, and on average, half of the last block of each file is wasted.

If that’s the problem, we think it’s pretty ridiculous to be arguing that ZFS can use smaller allocation blocks on larger file systems while, simultaneously, ignoring that the fatzap system requires spending 128KB for every arbitrary attribute not used by Solaris. Again, from MWJ 2007.06.11:

First, “128-bit file system” does not mean 128-bit block numbers–it means you need 128 bits to express the largest possible data size. Yet if you’re going to go to the theoretical, an HFS Plus volume can have 2^32 allocation blocks, each of which could theoretically hold as many as 2^32 bytes (that’s 2GB per allocation block), for a total of 2^64 bytes, or 16 exbibytes. According to Wikipedia, that’s the maximum size of any file system in ZFS anyway! “ZFS” has nothing to do with zettabytes, and the “128-bit file system” holds files exactly as big as HFS Plus can hold. (In fact, ZFS developer Jeff Bonwick admits that he picked the name “ZFS” first and then retrofitted it to stand for something, but now they say it doesn’t stand for anything since it really has nothing to do with zettabytes.)

This is what happens when people read slogans and mistake them for features. There’s a lot of innovation in ZFS, and it’s true that ZFS storage pools can literally hold one quintillion times more data than an HFS Plus file system can hold. If you’re absolutely convinced that the file system decisions you make today will be unchangeable in the next two decades, you probably have to care about that. No one else does.

Thaler says:

Sure, RAID-Z helps a lot by storing an error-correcting code. But even without RAID-Z, simply recognizing that the data is bad gets you well down the road to recovering from an error — depending on the exact nature of the problem, a simple retry loop can in fact get you the right data the second or third time. And as soon as you know there is a problem you can mark the block as bad and aggressively copy it elsewhere to preserve it. I suppose the author would prefer that the filesystem silently returned bad data?

No—the authors would prefer that Sun’s marketing department and ZFS fans stop deliberately conflating checksums with error recovery. Without RAID-Z, ZFS can only tell you that the data is bad. That is useful and significant, but remember what AppleInsider’s report actually said, quoting Sun’s marketing material:

“A scrub traverses the entire storage pool to read every copy of every block, validate it against its 256-bit checksum, and repair it if necessary,” the description reads. “All this happens while the storage pool is live and in use.”

ZFS can try to repair non-RAID data when it detects a bad checksum, and sometimes repeated reads can make that work. But the description and article say that ZFS does repair bad data without pointing out that to guarantee this functionality, you must use RAID to keep extra copies of the data. The description above needs big letters that say “attempts” and “sometimes,” and they’re almost always missing from ZFS advocacy.

Thaler’s opinions are well-founded, but we disagree with him on several important points. To wit:

  • “Hard disks … are building blocks that you can just drop in to add storage to your system. Partitioning, formatting, migrating data from old small drive to new big drive — these all go away.”

    True—but with current ZFS, you can never remove these building blocks. If you add an external hard drive to your MacBook’s ZFS pool, you must keep the external hard drive attached to the pool from then onward or else the filesystem is more or less destroyed. The only way around that is—wait for it—using RAID-Z so that the filesystems have at least one copy of the data.

    This is not rocket science, folks: if you split a file system across multiple physical devices, and then remove one of those devices, then a big chunk of your data is offline. No modern OS is designed around the idea that random blocks from a file may not be available. ZFS partisans tend to omit that you can’t remove devices from the storage pool unless all the data on said device is duplicated elsewhere in the storage pool. That means if you add a hard disk to your portable computer’s ZFS storage pool, you have to keep that hard disk attached to it from then onward, or destroy and recreate the entire filesystem to get rid of it, which may not be possible without at least temporary use of even more hard disks.

    This is a fine trade-off for what ZFS does, but in our opinion, that kind of pain is not worth it for the average Mac user. This seductive goal of just plugging in drives and using them “magically” looks a lot different when you realize what you have to go through to unplug them, something Thaler, like most ZFS partisans, never even mentions. (It’s not entirely clear that he’s thought about the implications of using ZFS on portable computers very much at all.)

  • Snapshots “can eliminate entire classes of problems,” serving as a system-wide trash can metaphor and preventing problems with losing data.

    Everyone loves the ZFS snapshots, but no one seems willing to point out that they eat disk space like crazy. Sun is fond of saying that snapshots take up no disk space, but that’s only true when they’re created, just like an empty file. Again, MWJ 2007.06.11:

    Since ZFS writes new copies of data and leaves the old data alone and “orphaned,” then if you think about it, the storage pool now has both the old and new versions of the same file. Eventually, ZFS will reallocate the older blocks and re-use them, but the ZFS developers realized it doesn’t have to do that. ZFS supports the concept of a snapshot—an exact copy of an entire file system at the moment in time when you create the snapshot (“take the picture,” if you will). A snapshot is just another ZFS object, and when you create it, it indeed takes up almost no storage whatsoever.

    How can that be? At the moment you create a snapshot, its contents are exactly the same as the live file system. There’s no difference, so it takes no storage space to record the differences. As you continue to modify data on the file system, ZFS uses copy-on-write to allocate new storage blocks for the modified data, as you just read. When the new data is confirmed written to disk, ZFS normally frees the blocks held by the old data. However, if any snapshots are active, ZFS simply reassigns the old storage blocks to the snapshot instead of marking it as free space. A snapshot, therefore, holds the difference between the live data and the state of the data at the time you created the snapshot. It’s extremely clever and a great feature.

    It’s not “free,” though. Keeping a snapshot of a file system means that deleting a file does not actually free any storage space on the disks in the pool—ZFS has to keep all those data blocks around as part of the snapshot once you “delete” the file. In fact, deleting a file can actually result in ZFS using more disk space than before you deleted it, because in some cases, it has to write a new copy of the file’s directory so the snapshot can keep the old copy.

    (You may recall that, in theory, deleting a file on an HFS or HFS Plus disk can also consume more disk space because the OS has to re-balance the directory tree, and that may require more storage space in some cases–but in HFS and HFS Plus, deleting a file always frees up at least one allocation block, provided the file contains at least one byte of data, so it’s all but impossible for it to result in a net loss of disk space. In ZFS, with snapshots, deleting files consumes more disk space as a matter of course.)

    Later in the same issue:

    The biggest factor against ZFS as a primary Mac OS X file system is that it eats disk space like Philadelphia eats cheesesteaks. If you have even one snapshot in your storage pool, the storage pool will never use less disk space than when you created the snapshot. It will keep all the data from that snapshot, as well as the current version of the filesystem. If you take snapshots every hour, as Time Machine is reported to do, you’ll fill up your drive fast.

    You’ll also still take the performance hit to copy the data to an external backup drive unless the OS somehow makes your backup and your primary drive part of the same storage pool—and if it does that, you won’t be able to run the computer without the backup drive. That’s likely not what MacBook Pro owners had in mind. Also remember that the self-healing capabilities of ZFS come from mirrors in the storage pool—unless you add enough drives to use mirrors, then you can still lose data, and you can’t remove non-mirrored drives from the storage pool at this time.

    [MWJ] should note that ZFS on an external drive would have one Time Machine advantage: through copy-on-write and snapshots, ZFS would make it easier for Time Machine to keep multiple backups of large files. If you changed 8KB in the middle of a 2GB database, an HFS Plus-based Time Machine would want to copy the entire 2GB database again. A version implemented on top of ZFS might use snapshots to record only the 8KB that changed. While this should not be minimized, [MWJ] also notes that you’d need your Time Machine backup volume to be a separate ZFS storage pool from your boot drive—you can’t merge them into a single storage pool for “free” snapshots or else it’s not a backup drive.

    When Thaler says snapshots are “so cheap you wouldn’t possibly notice,” he is either being naive or disingenuous. Installing Leopard, for example, would take the full 5GB for Leopard’s estimated storage (not including the extra 15GB it might need thanks to ZFS’s inefficient fatzap storage for single non-Solaris attributes), on top of all the disk space you currently use for Tiger. Even if you delete a file, the disk space is never reclaimed. To deny that this would eat disk space like crazy is simply incomprehensible.

  • ZFS is “designed to support Unix” with all its “subtleties,” while HFS Plus “was not really designed for that purpose.”

    It was designed with those concepts in mind, though. The problem with this statement is that ZFS was designed to support Solaris, and has some difficulties with file system attributes that Solaris doesn’t use. Again, MWJ 2007.06.11:

    [Extended attributes are] not a huge problem for Solaris—the original home of ZFS—because there, extended attributes are rare. Few files have them, and even fewer see them read or written frequently. In fact, although HFS Plus stores ACLs in extended attributes because it’s relatively fast and painless to access them, ZFS stores ACLs one level higher than attributes—at the same level where you find the pointer to the ZAP object for the file’s extended attribute—because opening the extended attributes ZAP object, reading the names, and finding the values in the fatzap would be far too slow for something that has to happen every time someone accesses the file.

    This unwieldy fatzap structure is what Mac OS X would have to use for every piece of file metadata that didn’t fit into the Unix metadata model, meaning there wasn’t already space allocated for it in the file’s catalog entry. File type and creator type can fit in a microzap object, but a resource fork cannot. Neither can the Finder Info structure associated with each HFS Plus catalog entry, or some extended attributes added by Apple and third-party developers. That’s because Apple told developers to name their attributes with reverse domain name syntax, as in “com.apple.filesystem.attribute”. Any such attribute names longer than 50 characters blow away the microzap rules and require allocating a fatzap object for the file.

    Each and every fatzap object begins with a 128KB header.

  • ZFS is “actually used by someone besides Apple,” something that starts a “virtuous circle” of more use and more support.

    This is a thinly-disguised version of the same thing command-line partisans have been spewing for two decades: things Apple invents must be bad, but open-source solutions to the same problems must be good. People who use command-line-based operating systems have been trying to kill HFS and its descendants since they day they appeared because they use extended metadata and other attributes that are hard to access from the command line. We’re now stuck with incredibly stupid, unwieldy filename extensions that describe a file’s data type in its filename because to do otherwise would prevent Unix fans from typing things like "*.jpg" and instead make them conform to the world that non-Unix people live in.

    In the comments to his post, Thaler repeats this error when he opines that case-insensitivity is bad for a file system because “it slows down performance, drags huge text-encoding tables into the kernel, creates heinous and subtle encoding problems, and reinfornces non-portable bad habits.”

    Well, boo-frickin-hoo. The point of the operating system is to make the computer easier for the customer to use, not for the programmer to maintain. Writing easy code and pushing the learning curve onto the user is how we got command-line systems in the first place. The very notion that people should be forced to learn case-sensitive rules on the Mac after more than two decades of case-insensitivity—really, that making life a little easier for three programmers is worth inconveniencing 25,000,000 users forever—is exactly why people don’t use those tools, no matter how loudly their implementors scream about what uneducated idiots everyone else is for not doing things the hard way.

    Thaler contends that case-insensitivity could be enforced in the human interface in the “Save” dialog boxes and in the Finder, “which are just about the only two places you actually need it.” And in the custom Adobe file boxes. And in Path Finder. And File Buddy. And in Terminal, unless you’re going to allow people to create case-sensitive filenames that duplicate case-insensitive ones that they then couldn’t access in any other way. And in AppleScript, and in URLs, and everywhere you might actually type a pathname or filename. Thaler’s idea of “correct” operating system design is to take the code to make sure that “File.txt” doesn’t exist in a directory before you write “file.txt” to that directory and move it into dozens, if not hundreds of applications, to prevent the operating system from having to do it once. This kind of push-the-problem-onto-others thinking is exactly what’s wrong with so much of modern OS design.

    Besides, HFS Plus is open-source as well as part of the Darwin project, and has been for five years. The “virtuous circle” could start just as easily by other people adopting it in their operating systems, but ZFS partisans don’t want to soil their machines with that. Maybe it uses too little disk space for them to take it seriously.

Thaler says, “But really, seriously, dude. The kid is cool. Don’t be like that. Don’t be a ZFS hater.” No one who reads our analysis in MWJ 2007.06.11 could possibly accuse us of hating ZFS. It’s a great file system for what it does. HFS Plus could really benefit from sparse files (although Thaler calling that a “modern filesystem concept” when it was present in Apple SOS/ProDOS in 1983 is a bit funny), copy-on-write, I/O sorting, and other stuff. Multiple prefetch streams would kind of require Mac OS X to get rid of the kernel funnel that prohibits more than one I/O transaction at a time, but that’s bound to happen sooner or later anyway.

Despite the fact that they only recently got any OS to boot from ZFS, it’s a fine file system for Solaris, and for big servers, and for big disks. It is not, in any significant way, suitable for today’s Macintosh systems, and punishing people today for how cheap hardware may be on their next computer is just stupid. As MWJ said:

Again, [MWJ] is not suggesting that these problems [making ZFS unsuitable for a sole Mac OS X filesystem] can’t be fixed—merely that it would be extremely expensive to deliver something that doesn’t offer current Mac OS X users much advantage, but creates lots of headaches. Would you like to be the Apple Genius who has to explain why a customer’s new US$129 operating system upgrade doesn’t let him recover any disk space when he deletes files? And that it’s supposed to work that way? ZFS would make a great addition for Xserve RAID and Mac OS X Server, and there’s little reason it shouldn’t be in Mac OS X itself for the discophiles who can’t wait to use it—but there’s even less reason why it should be the default file system for anything smaller than four drives and 2TB.

Don’t be a ZFS hater? Don’t be a Mac OS X hater, trying to punish people with tons of complications and worse performance on most Macs for advantages they won’t possibly notice for years. When the file system makes things harder yet provides most users with no benefits they can’t get at much lower cost today, it’s a bad idea. No matter how many geeks love it.