Let’s wrap up this ZFS thing, shall we?

Drew Thaler has responded to the response about ZFS, and now that we’re having an honest to goodness discussion about trade-offs, there’s a lot more where Thaler and MWJ agree than you might have imagined. We agree with Thaler on the five data points he lists as, well, points of agreement.

Yet although Thaler doesn’t like the point-by-point quote and response, it’s the clearest way to deal with misconceptions and misrepresentations, so we’ll stick by it for a little while longer. Let’s make even more clarity, shall we?

It would be an absolutely terrible idea to take people’s perfectly working HFS+ installations on existing computers and forcibly convert them to ZFS, chuckling evilly all the while. Not quite sure where that strawman came from.

It came from where this entire ZFS storyline came from—the reports in June that Apple was about to make ZFS the “default file system” in Leopard. That is, if you clicked “Erase and Install,” you’d wind up with a ZFS storage pool; that you’d be missing out on Leopard benefits unless you “switched” to ZFS, and so on.

This was not exactly a hidden story, then or now. The argument from MDJ and MWJ has never been that ZFS support is a bad idea—just that ZFS is a poor fit as a primary file system for today’s (10.4) and tomorrow’s (10.5) Mac OS X, for lots of reasons that were detailed in MWJ. Thaler obviously knew about it, because he called it a “weird and obviously fake rumor.” Lots of people didn’t find it so obviously fake. That’s why MWJ debunked it.

ZFS would be awfully nice for a small segment of the Mac OS X user base if it were ready today.

Absolutely, 100% true. Again, the argument is not that Mac OS X should not support ZFS, it’s that ZFS is a bad fit as Mac OS X’s default file system, which was, after all, the story. Every report since then, from AppleInsider’s leak-report on a v1.1 read-write preview for developers, to today’s analyst report, are in some way predicated on the concept that Mac OS X will not simply support ZFS, but rely upon it as primary storage in the very near future.

We think that unlikely for all the reasons stated.

ZFS — or something with all the features of ZFS — will be more than nice, it will be necessary for tomorrow’s Macintosh computers.

Uh…we’d rather state our point of agreement as “Tomorrow’s Macintosh computers will need improved file systems designed for much larger storage capacities and increaesd reliability.” ZFS is a good candidate for that, but it didn’t exist six years ago, and the best choice in six years may not exist today. If this is what Thaler means by “something with all the features of ZFS,” we might agree. We’re not sure it needs all the features of ZFS.

Then again, ZFS does not have all the features of HFS Plus (larger in-catalog file types, faster non-pathname access, and more efficient storage, for example).

Look, let’s be honest: most file systems are tailored for the OS architecture where they debuted. HFS and HFS Plus have lots of 32-bit quantities, even back when most processors were 16 bit, because the Mac processor architectures always had 32-bit registers. ZFS uses 64-bit and 128-bit quantities because it was designed for Solaris. While it’s extensible to “arbitrary” metadata, ZFS puts the data that Solaris needs directly in the catalog records or, at most, one level away so it can get to it very fast. The stuff that other operating systems might need gets shoved farther away.

Every engineering effort is about trade-offs. Many of the trade-offs in ZFS favor Solaris, just like many of the trade-offs in HFS Plus favor traditional Mac programming needs. It’s ridiculous to pretend otherwise, or to pretend that these trade-offs don’t make a difference for their respective operating systems.

Still think end-to-end data integrity isn’t worth it?

Wow, talk about strawmen—we never said end-to-end data integrity wasn’t “worth it,” whatever “it” may be. For some people, it very well may be. The error rate Thaler quotes is an error of one bit out of every 100,000,000,000,000 bits read. One out of every one hundred trillion bits, or as he calculates, one and a half errant bytes in a read of 150GB.

Is that “bad?” Sure—errors are always bad. How bad is it? That varies wildly. It might be a single bit error in an MP3 file, producing an audio blip that you’d never notice if you heard it a billion times. Or it might be a permission bit on the root of your filesystem, granting write permission it shouldn’t, which would be very bad indeed. If you’re recording live video in the field on a MacBook Pro using lossy compression, is the error rate of one byte every 100GB worth adding CPU time (and reducing battery life) to checksum every block?

We don’t know. We do, however, reject the notion that everyone would be willing to make that compromise with today’s hardware. We believe those who are should have the option. But, again, remember that much of the meta-discussion here is about ZFS becoming the default file system, as in “the file system on your MacBook’s internal hard drive by default.” That’s far less of a “choice” than using ZFS on an external volume.

Apple using ZFS rather than writing their own is a smart choice.

Re-inventing the wheel is usually a waste of time. Yet time and again throughout the decade of transition to Mac OS X, we’ve seen Apple discard technologies that work well for Mac users and programmers, and replace them with technologies designed for different programming conventions (i.e., filename extensions instead of true file types, tons of tiny files instead of resources, inflexible and fragile paths instead of flexible aliases, and so on).

The repeated, loud, insistent message from the open-source community is “stop doing what’s right for your platform and do it our way or we will scream and scream and scream until you do it our way.” It’s the entire command-line philosophy: make it easier on the programmers by making it harder on the users, and compensate by telling the users how stupid they are.

We decline to participate in this delusion.

The choices at this point are essentially twofold: (1) start completely from scratch, or (2) use ZFS. There’s really no point in starting over. ZFS has a usable license and has been under development for at least five years by now. By the time you started over and burned five years on catching up it would be too late.

See? Thaler creates another strawman by making our exact previous point, asserting there are only two choices: starting completely over, or using ZFS as it is. HFS Plus can’t be improved (even though its very existence shows that HFS could be improved, something that not many people saw coming a decade ago), ZFS ideas can’t be integrated into HFS Plus, nor can Apple invent yet another file system despite having outstanding file system talent ranging from the HFS Plus creators (though Deric is not in engineering anymore) to Dominic Giampolo, author of the BeFS. No, you’re told, it’s either the exact open-source solution or something never before seen.

ZFS is apparently so marvelous that, now that it’s been created, no one has the choice to eclipse it. Wow. As Thaler’s next paragraphs go on to say, in essence, ZFS is cool and HFS is not, and that’s apparently the end of any serious discussion. And that’s why so many people believe so many false things about ZFS. The entire discussion of filesystems is skewed, from the start, towards the default position that Macintosh-related filesystem constructs are somehow bad.

For example?

Another likely source of fatzaps in ZFS on Mac OS X is the resource fork. But with Classic gone, new Macs ship with virtually no resource forks on disk. There are none in the BSD subsystem. There are a handful in /System and /Library, mostly fonts. The biggest culprits are large old applications like Quicken and Microsoft Office. A quick measurement on my heavily-used one-year-old laptop shows that I have exactly 1877 resource forks out of 722210 files — that’s 0.2%, not 20%.

(Fun fact: The space that would be consumed by fatzap headers for these resource files comes out to just 235 MiB, or roughly six and a half Keyboard Software Updates. Again: not nothing, but hardly a crisis to scream about.)

And if it was wise to make decisions for 25,000,000 users based on Thaler’s laptop, that would be great. Our Power Macintosh G5 production system has non-zero resource forks on 76,060 files, or about 4.1% of the total number of files. They occupy about 2.1GB on disk using 4KB allocation blocks. Using a fatzap allocates another 128KB for each of those forks just for ZFS overhead, or an additional 9.3GB to store no additional information.

This system’s internal hard drive has 42GB of free space. “Switching” it to ZFS would eliminate 22% of our remaining free space to store no additional information. Thaler describes this as “negligible,” and “not nothing, but hardly a significant problem.” Your mileage may vary.

Seriously, folks—Solaris doesn’t use extended attributes much, so unless they’re tiny, each one requires 128KB of overhead. Thaler says:

Classic HFS attributes (FinderInfo, ExtendedFinderInfo, etc) are largely unnecessary and unused today because the Finder uses .DS_Store files instead. In the few cases where these attributes are set and used by legacy code, they should fit easily in a small number of microzaps.

Page 37 of the ZFS On-Disk Specification says that microzap objects can hold attributes only if all of the names are less than or equal to 50 characters (including NULL terminator byte) and if all of the values are 64-bit values. In essence, Thaler is saying that the Mac OS X implementation of ZFS would need to remap all accesses of FinderInfo and its siblings to be individual name-value extended attributes stored two ore more levels of indirection away from the catalog entry, even though they need to be read on just about every directory access.

Possible? Sure. Inefficient? Hard to say—with aggressive caching and fast hard drives, the OS might be able to mitigate the change, which would plainly require the drive to read at least three sectors from disk for every file instead of the one sector most entries take now. Fragmentation might actually become an issue. We know that you can’t assume there won’t be performance problems. Would they be worth the trade-offs? Probably, to some users today, to more users tomorrow. But not to everyone, and certainly not to everyone today.

There’s good news in this, though: the 128KB fatzap object can point to all of the attributes for a given object. If a given file has a resource fork and 200 extended attributes, it only needs one fatzap object, not 201 fatzap objects. We found only 300 files with extended named attributes on our production system’s drive, so if any of those also had resource forks (and it looks like they didn’t), they’d each only need one fatzap object to point to the extended attributes and a resource fork, which would itself actually be an “extended attribute” in ZFS terminology. Alas, none of the files had an extended attribute whose value was 64 bits or less, so they’d all require a fatzap.

ZFS snapshots don’t have to be wasteful.

No, of course they don’t. MWJ even pointed out that ZFS snapshots would be a far more efficient backup mechanism than Time Machine in Leopard, at least for large files.

That said, Thaler’s description of separating static and transient data is pretty much a pipe dream, unless you’re willing to change how you work to suit the computer (instead of the other way around).

Of course your “/Applications” folder is mostly static. So are most of the Unix folders like “/usr/bin“, your Fonts folders, and so on. ZFS snapshots affect an entire filesystem, so Thaler says the trick is to separate your storage into however many filesystems you need, all mounted at different places in the “/” hierarchy, so they can have different ZFS features:

Once the transient data is out of the picture, our snapshots will consist of 95% or more static data — which is not copied in any way — and a tiny percentage of dynamic data. And remember, the dynamic data is not even copied unless and until it changes. The net effect is very similar to doing an incremental backup of exactly and only the files you are working on. This is essentially a perfect local backup: no duplication except where it’s actually needed.

By transient, though, Thaler means what most Mac OS X users refer to as “temporary”—caches, temporary files, stuff that you can recreate or expect to be erased upon restart. If you shunt all that stuff off to a filesystem that has no snapshots, then creating a snapshot of what’s left would be more efficient. That’s true, and we grant that.

Snapshots are also an efficient way to store multiple backups. Think of a large E-mail database. During the course of the day, you’re likely going to receive lots of E-mail, and each new message changes the database. Time Machine would currently back up that entire database (let’s say it’s 700MB) every hour, because it changed every hour. A snapshot would only include the actual data blocks changed as you received E-mail. If you only received 250KB of E-mail during the day, the snapshot would only record 250KB worth of changes.

Even excluding transient files, though, you’re talking about a lot of data. Backing up full files, we currently back up about 6GB per day of files from our production system, just that had changed since the previous night, excluding the “transient” folders. If ZFS eliminated 90% of that as unchanging, it would still be about 600MB per day (or one CD-ROM), and that’s for only one copy. Thaler speaks of 12 hourly snapshots, seven daily snapshots, and four weekly ones. That’s going to take a fair amount of disk space in your storage pool, because that’s how snapshots work. It’s not like you can plug in an external drive, take a snapshot “to” that drive, and unplug it.

Thaler is right that if data doesn’t change, it doesn’t take up space in a snapshot—but then again, it doesn’t take up space in Time Machine, either. Except in the initial backup, of course, and to be safe you still need to make one of those with a ZFS storage pool, especially one that’s not mirrored. If a power surge or other non-software glitch kills your drive, it’s just as dead with or without checksums. Extra copies of the data on that same drive won’t really help you. (Note, of course, that Time Machine will also let you make this mistake, though it’s not quite as easy.)

Absent the full description of the copy-on-write mechanism used by snapshots in MWJ 2007.06.11, some people have misinterpreted our comments about snapshots not releasing disk space. As long as a data block is used by any snapshot, it cannot be reclaimed. So, for example, let’s assume that a new E-mail message changes one of the blocks in your E-mail database, on a filesystem that has active snapshots. The data block that just got replaced remains allocated, but it is now allocated to the snapshot and not to the database file itself.

That’s really a very cool way to do it. But, you must be aware that until you kill all of the snapshots that reference that block, it remains allocated on disk. If you delete your entire database file to “save disk space” and switch to a different E-mail program, you’re not saving any disk space, because the snapshots still have all of the blocks you deleted, and retain them until you destroy the snapshots.

Since snapshots describe an entire filesystem, you get the choice of using them to back up everything or nothing. If you’d rather only back up “/Applications” once per week, but “~/Documents” once per hour, you have to have them in separate file systems, mounted at the appropriate places in the hierarchy. Thaler says he doesn’t expect users to set this up, but that Apple could. He’s right — but if you want to change it, welcome to (at present) the world of ZFS command-line arcana.

Also, note that since the way to access a snapshot is to mount it (as a clone), the way to find backups of files is to mount a copy of your filesystem that is the snapshot, so you can go into it and find the file and copy it over to the live filesystem (or, if you want and can keep the paths straight, to read it directly from the snapshot). A good human interface can simplify this, of course, but that’s where it stands today.

So, yeah, there’s a lot of great stuff about snapshots. They don’t solve the world’s ills, and a lot of the push for ZFS pretty much assumes they do. Look at these entries in Thaler’s comments section:

The MWJ rebuttal claims “without RAID-Z, ZFS can only tell you that the data is bad.” This is not true.

ZFS has significant self-healing capabilities even when used on a single disk. Specifically, the filesystem’s uberblock and all metadata blocks are replicated. ZFS also allows file data to be replicated via ditto blocks. While it is possible that every copy of an block could be corrupted, this is extremely unlikely.

Well, yeah. All file systems allow blocks to be replicated. This is called “backing up.” Note that the words “ditto” or “replicate” do not appear anywhere in the ZFS On-Disk Specification, either, so it’s not like these are features automatically provided by every ZFS implementation. They could be, but if they were, wouldn’t the disk spec include descriptions of those replicated blocks so all implementations would do it the same way? (We may be missing it, but we’ve read the spec a few times now, and searched for keywords like “copy,” and we’re not finding it.)

Plus, the commenter doesn’t seem to realize that HFS Plus also keeps duplicate copies of key filesystem data in extra blocks. Not as much as he says ZFS does, but some. And if you’re arguing that ZFS replicates every block, that’s therefore indistinguishable from “mirroring” in any significant respect, including taking twice the hard disk space.

It’s great to have such options, but you don’t get to argue that ZFS really won’t eat a lot of disk space and then argue that keeping multiple copies of every block (and using a lot of disk space) answers the other objections. It’s all trade-offs, but you don’t get that from ZFS advocates.

I was really struck that MacJournal’s article sounded a lot like what everyone was saying when Apple was switching to NeXT’s OS: a lot of scare-mongering about how it’s the wrong fit, isn’t needed, and how what we’re used to now will be good enough in the future. ZFS is amazing, and if Apple can put a decent GUI on it we’d be fools to not want it. I can’t wait for the linux community to get their act together and replace ext3/4 with it for my servers.

In other words, “HFS Plus is old, ZFS is cool. Let’s tear everything up and replace it with ZFS! You’d be stupid to not want this open-source marvel.” Where, exactly, were those strawmen again?

1) replacing drives in pools: zpool replace and sliver/unsliver should do most of what they want.

We didn’t say replacing drives would be difficult. We said removing drives would be difficult, because it is, and we’re saying it because the ZFS advocates won’t. Look, for example, at this bit of “analysis” from Blackfriars’ Carl Howe in his ZFS lovefest today:

Want to add more storage? Just add another disk to the pool, and ZFS knows what to do. Want to replace a disk? Tell ZFS to remove it from the pool, and it clears it off for you.

OK, so suppose you have a computer with a 500GB internal hard drive and a 300GB external hard drive, and you’re using 700GB of storage space in a single pool between the two drives. If you truly want to replace the 300GB with a new one, you can—just connect the new drive, add it to the storage pool, and tell ZFS to clear stuff off the old one.

If you want to remove the external drive, well, you can’t. It’s not mirrored, and not even ZFS is cool enough to synthesize 200GB of storage space out of thin air to hold the data that would go missing if you removed the external drive without providing at least 200GB of storage space to replace it. Once you’ve added a drive to a ZFS storage pool, you’re stuck with it. If you have enough space on all drives to store what you’d remove from one drive, you can ask ZFS to replace or sliver/unsliver to free up the drive you want to remove. If you don’t, you’ve got to add more drives to be able to remove existing ones or your volume is damaged.

You can try telling us that people not intimately versed in filesystems won’t read that as “add or remove drives any time you like and it all just magically works,” but we’re not buying it. Adding magically works. Removing does not. This also means, by the way, that if your external drive fails, your entire system is compromised, including all of the on-disk snapshots on that drive. Not quite what most people wanting “next-century” storage had in mind, we’d bet. And there’s nothing wrong with that. That’s an entirely reasonable design trade-off for ZFS — but you never hear about it from ZFS advocates, who want to pretend that their file system is magic and infallible, and everything else is old and icky. It just doesn’t work that way.

snapshot size issues are identical with Apple’s TimeMachine, but there they are on a per-file basis, not a per-block basis.

In which case they’re not identical. (Time Machine snapshots are larger.)

Performance concerns are irrelevant.

Well, that resolves that.

The other thing that is rarely mentioned is that snapshots via copy-on-write are MUCH more efficient than the file-based snapshots of Time Machine for large files.

Well, we mentioned it: in MWJ 2007.06.11, in the first response to Thaler, and again here. It’s one of those design trade-offs. It’s not like ZFS has no advantages, it’s just that it’s not the solution to every problem for which it’s proffered.

Every time you boot your Windows or Linux VM, the whole virtual disk (Vista’s minimum is over 10GB) will be changed. With Time Machine, my understanding is all of that data will be backed up, even though 99% of the drive file is static. Under ZFS, only those few changed blocks will be backed up. Much more efficient and lightweight.

Yes, as we said. Though in this example, if your virtual disk is accessible to the OS as a mountable volume, Time Machine would only back up the files on the virtual volume that changed, at least if properly configured. If not, then yes, it would want to copy the entire 10GB virtual volume each time.

Can someone clue me in on who MWJ is or why anything they say matters to anyone? I think I was sick that day.

We’ve been ill much of the past year, so we’re sympathetic.

See here for information on MWJ, which has been publishing ad-free Macintosh analysis and commentary, relied upon by professionals around the world, for more than ten years. MWJ’s analysis has also been reprinted in Macworld, TidBITS, and MacAddict, among other places. MDJ and MWJ staffers have presented at WWDC as far back as 1990. You can check out free sample issues going back a decade, including a complete description of HFS Plus and why it evolved the way it did. There’s lots of free stuff on the site; please read and enjoy it.

In fact, we’ve only been able to do this publicly in the hopes of getting clear information out there—not that ZFS is “bad” or “wrong,” but that people need to be honest about the trade-offs involved both in its design and in what that would mean for Mac OS X users and applications. If you like what you’ve read, sign up for the “free trial” of MDJ or MWJ and see more of it as we can get it done. But those commitments require that this be the end of this for a while. Thanks for your understanding.