|
|
Subscribe / Log in / New account

Fast commits for ext4

January 15, 2021

This article was contributed by Marta Rybczyńska

The Linux 5.10 release included a change that is expected to significantly increase the performance of the ext4 filesystem; it goes by the name "fast commits" and introduces a new, lighter-weight journaling method. Let us look into how the feature works, who can benefit from it, and when its use may be appropriate.

Ext4 is a journaling filesystem, designed to ensure that filesystem structures appear consistent on disk at all times. A single filesystem operation (from the user's point of view) may require multiple changes in the filesystem, which will only be coherent after all of those changes are present on the disk. If a power failure or a system crash happens in the middle of those operations, corruption of the data and filesystem structure (including unrelated files) is possible. Journaling prevents corruption by maintaining a log of transactions in a separate journal on disk. In case of a power failure, the recovery procedure can replay the journal and restore the filesystem to a consistent state.

The ext4 journal includes the metadata changes associated with an operation, but not necessarily the related data changes. Mount options can be used to select one of three journaling modes, as described in the ext4 kernel documentation. data=ordered, the default, causes ext4 to write all data before committing the associated metadata to the journal. It does not put the data itself into the journal. The data=journal option, instead, causes all data to be written to the journal before it is put into the main filesystem; as a side effect, it disables delayed allocation and direct-I/O support. Finally, data=writeback relaxes the constraints, allowing data to be written to the filesystem after the metadata has been committed to the journal.

Another important ext4 feature is delayed allocation, where the filesystem defers the allocation of blocks on disk for data written by applications until that data is actually written to disk. The idea is to wait until the application finishes its operations on the file, then allocate the actual number of data blocks needed on the disk at once. This optimization limits unneeded operations related to short-lived, small files, batches large writes, and helps ensure that data space is allocated contiguously. On the other hand, the writing of data to disk might be delayed (with the default settings) by a minute or so. In the default data=ordered mode, where the journal entry is written only after flushing all pending data, delayed allocation might thus add more delay between the actual change and writing of the journal. To assure data is actually written to disk, applications use the fsync() or fdatasync() system calls, causing the data (and the journal) to be written immediately.

Ext4 journal optimization

One might assume that, in such a situation, there are a number of optimizations that could be made in the commit path; that assumption turns out to be correct. In this USENIX'17 paper [PDF], Daejun Park and Dongkun Shin showed that the current ext4 journaling scheme can introduce significant latencies because fsync() causes a lot of unrelated I/O. They proposed a faster scheme, taking into account the fact that some of the metadata written to the journal could instead be derived from changes to the inode being written, and it is possible to commit transactions related to the requested file descriptor only. Their optimization works in the data=ordered mode.

The fast-commit changes, implemented by Harshad Shirwadkar, are based on the work of Park and Shin. This work implements an additional journal for fast commits, but simplifies the commit path. There are now two journals in the filesystem: the fast-commit journal for operations that can be optimized, and the regular journal for "standard commits" whose handling is unchanged. The fast-commit journal contains operations executed since the last standard commit.

Ext4 uses a generic journaling layer called "Journaling Block Device 2" (JBD2), with the exact on-disk format documented in the ext4 wiki. JBD2 operates on blocks, so when it commits a transaction, this transaction includes all changed blocks. One logical change may affect multiple blocks, for example the inode table and the block bitmap.

The fast-commit journal, on the other hand, contains changes at the file level, resulting in a more compact format. Information that can be recreated is left out, as described in the patch posting:

For example, if a new extent is added to an inode, then corresponding updates to the inode table, the block bitmap, the group descriptor and the superblock can be derived based on just the extent information and the corresponding inode information.

During recovery from this journal, the filesystem must recalculate all changed blocks from the inode changes, and modify all affected data structures on the disk. This requires specific code paths for each file operation, and not all of them are implemented right now. The fast-commits feature currently supports unlinking and linking a directory entry, creating an inode and a directory entry, adding blocks to and removing blocks from an inode, and recording an inode that should be replayed.

Fast commits are an addition to — not a replacement of — the standard commit path; the two work together. If fast commits cannot handle an operation, the filesystem falls back to the standard commit path. This happens, for example, for changes to extended attributes. During recovery, JBD2 first performs replay of the standard transactions, then lets the filesystem recover fast commits.

fsync() side effects

The fast-commit optimization is designed to work with applications using fsync() frequently to ensure data integrity. When we look at the fsync() and fdatasync() man pages, we see that those system calls only guarantee to write data linked to the given file descriptor. With ext4, as a side effect of the filesystem structure, all pending data and metadata for all file descriptors will be flushed instead. This creates a lot of I/O traffic that is unneeded to satisfy any given fsync() or fdatasync() call.

This side effect leads to a difference between the paper and the implementation: a fast commit may still include changes affecting other files. In a review, Jan Kara asked why unrelated changes are committed. Shirwadkar replied that, in an earlier version of the patch, he did indeed write only the file in question. However, this change broke some existing tests that depend on fsync() working as a global barrier, so he backed it out.

Ted Ts'o commented that the current version of the patch set keeps the existing behavior, but he can see workloads where "not requiring entanglement of unrelated file writes via fsync(2) could be a huge performance win." He added that a future solution could be a new system call taking an array of file descriptors to synchronize together. For now, application developers should base their code on the POSIX definition, and not rely on that specific fsync() side effect, as it might change in the future.

Using fast commits

Fast commits are activated at filesystem creation time, so users will have to recreate their filesystems to use this feature. In addition, the required support in e2fsprogs has not yet been added to the main branch, but is still in development. So interested users will need to compile the tool on their own, or wait until the feature is supported by their distribution. When enabled, information on fast commits shows up in a new /proc/fs/ext4/dev/fc_info file.

On the development side, there are numerous features to be added to fast commits. These include making the operations more fine-grained and supporting more cases that fall back to standard commits today. Shirwadkar is also working on fast commits with byte-granularity (instead of the current block-granularity) support for direct-access (DAX) mode, to be used on persistent memory devices.

The benchmark results given by Shirwadkar in the posted patch set show 20-200% performance improvements with filesystem benchmarks for local filesystems, and 30-75% improvement for NFS workloads. We can assume that the performance gain will be more important in applications doing many fsync() operations than in those doing only a few. Either way, though, the fast-commits feature should lead to better ext4 filesystem performance going forward.

Index entries for this article
KernelFilesystems/ext4
GuestArticlesRybczynska, Marta


to post comments

Fast commits for ext4

Posted Jan 15, 2021 18:49 UTC (Fri) by NYKevin (subscriber, #129325) [Link] (5 responses)

I have some thoughts about this:

1. As usual, Ted's reasoning seems eminently sensible to me. Apps that really want to sync the whole filesystem should *already* be using syncfs(2).
2. Still, I could imagine some apps not knowing about the "and also you have to fsync the directory" rule (see fsync(2)), which is currently not really enforced since fsync in practice flushes "everything." It might be worth special-casing that, or providing a flag. But hard links make this trickier (which directory do you want to fsync?).
3. It would be really nice if rename(2) would function as a write barrier, or at least not be reordered before any writes to the file that is renamed, but I'm not sure where current filesystems stand on that... I know this has definitely been discussed in the past, though (see for example this article: https://lwn.net/Articles/351422/).
4. Maybe if this gets performant enough, we can rename O_DSYNC to O_PONIES.

Fast commits for ext4

Posted Jan 16, 2021 7:46 UTC (Sat) by khim (subscriber, #9252) [Link] (4 responses)

Hyrum's Law says that what Ted says is kind of irrelevant. If implementation provided full sync instead of what the documentation says then it should continue to do so.

You may provide optional support for something else and then, over course of many years test more and more apps with it. Maybe, just maybe it may become a new default, eventually. But that's not guaranteed…

Doing anything else is just asking for trouble, unfortunately.

Fast commits for ext4

Posted Jan 16, 2021 19:39 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (3 responses)

I am generally in favor of Hyrum's law, but I do have my limits, and one of those limits is "Was anyone even relying on that in the first place?"

See for example the line in https://www.freebsd.org/cgi/man.cgi?query=fcntl&sekti... about the "completely stupid semantics of System V" (yes, they really put that in a man page). IIRC someone looked into it, and:

1. It had been standardized that way because that's what one particular implementation decided to do. There was never a proper rationale or anything.
2. Nobody could find any apps which actually rely on this behavior.
3. They were able to find at least one app which *accidentally* triggers this behavior and gets buggy file locking as a result.

If this wasn't already standardized in POSIX, I would be heavily in favor of Linux etc. just changing the behavior to something more sensible. Unfortunately, it's probably a Bad Idea to deliberately violate POSIX, even when POSIX is obviously silly, so we're likely stuck with it. But it would've been nice if someone would have caught this earlier, and I think that Hyrum's Law would not have been helpful in that situation.

Moving on to the actual topic of discussion here. Applications, in general, don't use fsync much at all. Many of them do the open/write/rename ritual, which technically requires an fsync (after the write and before the rename) to guarantee a safe ordering, but I don't think many people bother with that. I am of the opinion that Hyrum's Law ought to require the rename to function as a barrier in this case, as I wrote in my initial comment. I don't think many apps are using an fsync of /path/to/foo.txt in order to guarantee durability of /path/to/bar.txt, but we would need to do a survey to be sure. If I'm right about that, then I don't think we need to apply Hyrum's Law here as (a) few apps would be broken in practice, (b) there's already an interface (syncfs) to do what those apps want (so it'd be easy to fix), and (c) the performance wins for all of the apps which use fsync correctly could potentially be quite large. Also (d) we don't want to penalize apps written for other Unices for using the standard interface, so introducing a Linux-only "no really, just fsync this one file" syscall is a Bad Idea.

Fast commits for ext4

Posted Jan 18, 2021 19:24 UTC (Mon) by tytso (subscriber, #9993) [Link]

So for decades, competently written text editors write new precious files, such as source files via:

1) Write the new contents of foo.c to foo.c.new
2) Fsync foo.c.new --- check the error return from the fsync(2) as well as the close(2)
3) Delete foo.c.bak
4) Create a hard link from foo.c to foo.c.bak
5) Rename foo.c.new on top of foo.c

This doesn't require an fsync of the directory, but it guarantees that /path/to/foo.c will either have the original contents of foo.c., or the new contents of foo.c, even if there is a crash any time during the above process. If you want portability to other Posix operating systems, including people running, say, retro versions of BSD 4.3, this is what you should do. It's what emacs and vi does, and some of the "ritual", such as making sure you check the error return from close(2), is because other wise you might lose data if you run into a quota overrun on the Andrew File System (the distributed file system developed at CMU, and used at MIT Project Athena, as well as several National Labs and financial institutions).

That being said, rename is not a write barrier, but as part of the O_PONIES discussion, on a close(2) of an open which was opened with O_TRUNC, or on a rename(2) where the destination file is getting overwritten, the file being closed, or the source file of the rename will have an immediate write-out initiated. It's not going to block the rename(2) operation from returning, but it narrows the race window from 30 seconds to however long it takes to accomplish the writeout, which is typically less than a second. It's also something that was implemented informally by all of the major file systems at the time of the O_PONIES controversy, but it doesn't necessarily account for what newer file systems (for example, like bcachefs and f2fs) might decide to do, and of course, this is not applicable for what other operating systems such as MacOS might be doing.

The compromise is something that was designed to minimize performance impact, since users and applications also depend upon --- and get cranky --- when there are performance regressions, while still papering over most of the problems caused by careless application. From file system developers' perspective, the ultimate responsibility is on application writers if they think a particular file write is precious and must lost be lost after a system or application crash. After all, if the application is doing something really stupid, such as overwriting a precious file by using open(2) with O_TRUNC, because it's too much of a pain to copy over ACL's and extended attributes, so it's simpler to just use O_TRUNC and overwrite the data file and crossing your fingers. There is absolutely no way the file system can protect against application writer stupidity, but we can try to minimize the risk of damage, while not penalizing the performance of applications which are doing the right thing, and are writing, say, a scratch file.

Fast commits for ext4

Posted Jan 19, 2021 8:15 UTC (Tue) by LtWorf (subscriber, #124958) [Link] (1 responses)

What does the system call do in other filesystems?

If it's an ext4 quirk then any software relying on that would already break just by virtue of moving to a different filesystem.

Fast commits for ext4

Posted Jan 20, 2021 7:48 UTC (Wed) by viiru (subscriber, #53129) [Link]

> What does the system call do in other filesystems?

> If it's an ext4 quirk then any software relying on that would already break just by virtue of moving to a
> different filesystem.

In my understanding that is exactly what it is. Ext3 shares the behavior, but for example XFS does not. This caught many application developers by surprise, but that happened a couple of decades ago and has most likely been fixed in any sensible application.

What about other filesystems?

Posted Jan 15, 2021 19:57 UTC (Fri) by Wol (subscriber, #4433) [Link] (45 responses)

All this talk about "apps should use fsync" or "apps should use fsfsync" fills me with horror, as someone who wants to write a system that relies on data integrity. Are you telling me that my app needs to be filesystem-aware, and not only that but aware of what mount options were used, so I know which commands to call to make sure that my data is safe?

And no, I don't want to offload that onto the glibc guys either.

I know it'll be hard, but is there any way we can get the VFS to dictate that certain things are supposed to happen in a certain order. As an app, I don't care *how* the OS does it, but I want to be able to reason about what's hit the disk and when. And I DON'T want to have to worry about what the filesystem is or how it does it.

I've said it before, but I don't care too much about fsync or fsfsync, and I really don't want to have to worry about how the OS will react to those calls - one only has to go back to the ext3/ext4 transition/debacle where code that was fast on ext3 brought ext4 to its knees... if I can know for certain that stuff hits the disk in the order I write it, and tell the system what it can write in parallel and what it can't ... (basically data can be written in parallel, logs can be written in parallel, just not in parallel with each other. And the OS needs me to tell it the difference between the two, it won't know by itself.)

Cheers,
Wol

What about other filesystems?

Posted Jan 15, 2021 20:19 UTC (Fri) by joib (subscriber, #8541) [Link] (2 responses)

Carefully ordering writes was the idea behind <a href="https://www.usenix.org/legacy/publications/library/procee...">soft updates</a> previously(?) used in the BSD FFS. I believe it was replaced by a more traditional journaling approach because modern hardware is dependent on queuing for good performance, and all current hardware command queuing mechanisms are unordered. On such hardware soft updates wasn't able to keep up with journaling, which is better able to batch updates.

What about other filesystems?

Posted Jan 16, 2021 0:13 UTC (Sat) by wahern (subscriber, #37304) [Link] (1 responses)

You may be thinking of NetBSD, which replaced softupdates with journaling in FFS. Softupdates is still the only option on OpenBSD, but I'm more than contented on that front. AFAIU, FreeBSD also still supports softupdates, though the FreeBSD community seems to be migrating to ZFS.

What about other filesystems?

Posted Feb 1, 2021 14:25 UTC (Mon) by drjohnnyfever (guest, #144560) [Link]

The situation on FreeBSD is actually a bit complicated. The current default configuration uses journaled soft updates (su+j) which work like traditional soft updates with the exception that there is enough metadata journaled to avoid having to run a background fsck to reclaim space from orphaned allocations.

FreeBSD also supports proper journaling (logging) with gjournal which works at the block layer rather than in UFS itself. If I recall correctly, gjournal keeps a proper intent log of disk writes rather than just a metadata journal. It also allows you to use a separate device for the log.

ZFS does seem to be the way forward on FreeBSD but Netflix is pretty big consumer of UFS/FFS so they have been sponsoring continued development.

What about other filesystems?

Posted Jan 15, 2021 23:18 UTC (Fri) by zlynx (guest, #2285) [Link] (11 responses)

The way I remember it is that code which was fast on ext3 would lose all of its data on ext4. Or xfs. Or any filesystem other than ext3.

It was fast because it took advantage of ext3 data=ordered mode which forced all of the data writes to commit before doing things like a file rename. But that code was _never correct_. It was simply lucky.

From that whole disaster I think we all learned that making your filesystem provide ponies is dumb and stupid. Unless you brought ponies for every filesystem all you have accomplished is to create application developers who write applications that WILL lose data on any filesystem without ponies.

POSIX applications that need data safely on disk need to use fsync() or they are not POSIX and they are broken.

What about other filesystems?

Posted Jan 16, 2021 7:27 UTC (Sat) by dvdeug (guest, #10998) [Link] (10 responses)

fsync() is not ISO C; if I'm reading and understanding the man pages right, it wasn't in POSIX pre-2001 nor in versions of Unix prior to BSD 4.3. So standard C and old-school Unix had no support for applications that needed data safely on disk?

Linux filesystems still provide all sorts of ponies. POSIX filenames are way more limited than XFS, ext3 or ext4 provide, and it does bite people when interfacing with other OSes, including other Unixes. It's frustrating to call data safety a pony instead of admitting that they were trading data safety for speed, and that other people could validly disagree with your tradeoffs.

What about other filesystems?

Posted Jan 16, 2021 17:08 UTC (Sat) by Wol (subscriber, #4433) [Link]

> fsync() is not ISO C; if I'm reading and understanding the man pages right, it wasn't in POSIX pre-2001 nor in versions of Unix prior to BSD 4.3. So standard C and old-school Unix had no support for applications that needed data safely on disk?

I think old-school Unix ran on disks that wrote data in the order it was passed down to it. And old-school Unix just wrote stuff to disk without buffering and re-ordering and all that stuff.

And what application developers need is the ability to tell new-school Unix that SOMETIMES, the old way is best!

Cheers,
Wol

What about other filesystems?

Posted Jan 16, 2021 19:15 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (8 responses)

Under old-school Unix, if power dropped at the wrong moment, you could lose the whole filesystem. In practice, you would often be able to recover some or most of it with fsck (or an fsck-like process), but there was never any guarantee that a system crash was a recoverable event. As a result, it has always been the case that "If you don't call fsync, then you might lose data." It's just that this used to be vacuously true (you couldn't call fsync, because it didn't exist).

On the other hand, an orderly shutdown has never lost data on any (reasonable, properly-engineered)Unix (that I'm aware of), whether you fsync or not. This is still true today.

What about other filesystems?

Posted Jan 17, 2021 8:16 UTC (Sun) by dvdeug (guest, #10998) [Link] (7 responses)

Linux can't save you if the computer fails due to any number of physical problems. It has always been the case that "you might lose data". The change is that older Unixes don't require you to do anything special to achieve maximal data safety offered, whereas modern systems require you to do something special for the OS to try its best. Arguably (as Wol does), going from ordered behavior to complex reordering is a downgrade in the promised level of support.

Taking code that didn't require fsync (because it doesn't exist) and, in the words of zlynx, saying that "it's broken" makes all ISO C code that needs data safety broken, which seems extreme. From my perspective, filesystem developers got the ability to increase safety by default or speed by default, and chose speed. That doesn't really upset me, so much as the fact that the word "pony" gets pulled out and one side gets painted as unreasonable, instead of it getting painted as a tradeoff and argued on that basis.

What about other filesystems?

Posted Jan 17, 2021 9:33 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (2 responses)

1. Modern filesystems are much safer than old filesystems, by default. When was the last time you had to run fsck on boot?
2. I did not claim that old code was "broken," merely that it was at risk of losing data. My point is that both the application developer and the sysadmin would have been aware of that problem, and would take appropriate steps to remediate it (such as making regular backups, building a RAID, or whatever else makes sense). Everyone should still be taking those steps today, because as you say, nothing is 100% reliable.
3. Safety and speed are a tradeoff. But since we can't get to 100% safety, the primary value of safety is extrinsic: a safer system causes us to spend less time and resources on recovery (e.g. sitting around waiting for fsck to complete so I can boot my machine). So safety is itself a form of speed, and we can directly compare the time spent on recovery to the time spent on disk I/O - and as it turns out, once you make fsck obsolete, the disk I/O is a lot bigger for most people under most circumstances.

What about other filesystems?

Posted Jan 17, 2021 17:25 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

> 2. I did not claim that old code was "broken," merely that it was at risk of losing data. My point is that both the application developer and the sysadmin would have been aware of that problem, and would take appropriate steps to remediate it (such as making regular backups, building a RAID, or whatever else makes sense). Everyone should still be taking those steps today, because as you say, nothing is 100% reliable.

RAID is useless if it can't guarantee that stuff has been safely saved to disk ... which it can't if the linux layers provide no guarantees ...

Backups are pretty much useless BY DEFINITION, because if the data is corrupted while saving to disk (which is what we're discussing here), then it's not been around long enough to be saved to backup.

Come on, all I'm asking for is the ABILITY TO REASON about what is happening, so I can provide my own guarantees. "The system may or may not have saved this data in the event of a crash" is merely the filesystem guys saying "not our problem", and the references to the SQLite guys jumping through hoops to make certain is the perfect example of them having to do somebody else's job, because surely it's the filesystem's guys' job to make sure that data entrusted to the filesystem is actually safely saved by the filesystem.

If I can have some guarantee that "this data is saved before that data starts to be written", then at least I can reason about it.

And yes, I know making all filesystems provide these sort of guarantees may be fun - I'm on the raid mailing list - I know - because I read all the messages and glance at all the patches and all that stuff (and don't understand much of it :-) - but when (I know, I know) I find the time to start really digging in to it, I want the raid layer to provide exactly those guarantees.

And why can't we say "these are the guarantees we *intend* to provide", and make it a requirement that anything new *does* provide them! If I provide a "flush" in the raid code, I can then pass it on to the next layer down, and then when it says it's done it I can then pass success back up (or E_NOTSUPPORTED if I can't pass it down). But this is exactly another of those *new* things they're trying to get into the linux block layer, isn't it - the ability to pass error codes back to the caller other than the most basic of "succeeded" or "failed", isn't it? If they can get that in, surely they can get my "flush" in, can't they?

Cheers,
Wol

What about other filesystems?

Posted Jan 18, 2021 17:47 UTC (Mon) by hkario (subscriber, #94864) [Link]

precisely, the issue is not that hardware can fail and that the file system can't promise anything in such case, the problem is that there is no specification common _to all file systems_ that says what is expected to happen under such and such scenarios

or to put it other way round: every file system will exhibit different behaviour on power failure and every file system requires slightly different handling to get something you can reasonably expect (like, when the file system says it committed data to disk, the data is committed to disk)

that's no way to program stuff when dealing with such fundamental thing in computing as data storage

What about other filesystems?

Posted Jan 17, 2021 17:28 UTC (Sun) by Wol (subscriber, #4433) [Link] (3 responses)

> Taking code that didn't require fsync (because it didn't (sic) exist) and, in the words of zlynx, saying that "it's broken" makes all ISO C code that needs data safety broken, which seems extreme.

Actually, I think that's called a regression, is it not? And one of Linus' absolute rules is "no regressions", isn't it?

Cheers,
Wol

What about other filesystems?

Posted Jan 17, 2021 20:31 UTC (Sun) by matthias (subscriber, #94967) [Link] (2 responses)

>> Taking code that didn't require fsync (because it didn't (sic) exist) and, in the words of zlynx, saying that "it's broken" makes all ISO C code that needs data safety broken, which seems extreme.
>Actually, I think that's called a regression, is it not? And one of Linus' absolute rules is "no regressions", isn't it?
There is no regression. The code works as good as back in the days. Back in the days it was clear, that the data is only safe is the system is working properly, including no power outages. If you make sure that your system never crashes, the old code will work fine. If the system crashes, the old code might loose data, but this was always the case with this code. If you want additional guarantees (like no data loss in case of power loss), you have to use fsync.

Best,
Matthias

What about other filesystems?

Posted Jan 17, 2021 21:23 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

mmmm

The risk of a corrupted filesystem hasn't changed.

But if the application writes a journal before doing an update, then provided there's no collateral damage it can recover from a crash mid transaction on an old unix system.

On a new system, it can't be sure whether the transaction log is okay and the update is damaged, or the transaction log is damaged and the transaction is lost, or even worse the transaction log is damaged and the transaction is partially complete!

Cheers,
Wol

What about other filesystems?

Posted Jan 18, 2021 10:29 UTC (Mon) by farnz (subscriber, #17727) [Link]

No, because even on ancient systems, you had elevator reordering for performance, and no guarantees about metadata writes; in the event of a crash, you simply did not know the state of the update or the transaction log, as even if you wrote them in a careful order, the elevator could reorder writes to disk, and the metadata writes might be reordered, too.

In other words, as soon as there's a kernel panic or a power failure, all bets are off on an old UNIX system. This wasn't an issue with reliable systems, but as reliability went down (no dedicated power supplies, no UPSes etc), it became an issue again.

What about other filesystems?

Posted Jan 15, 2021 23:46 UTC (Fri) by MrWim (subscriber, #47432) [Link] (1 responses)

I think the lesson is use sqlite, unless you've got a very good reason not to. They do care about persistence, compatibility and performance, and they've done (and will continue to do) all the hard work to make it work correctly, easily and predictably.

If you try and build on plain POSIX APIs you run the risk of filesystem developers changing the behaviour that you built and tested against out from under you, complete with a bunch of grey beards on the internet telling you that you've been doing it wrong all along..

What about other filesystems?

Posted Jan 16, 2021 17:32 UTC (Sat) by Wol (subscriber, #4433) [Link]

> I think the lesson is use sqlite, unless you've got a very good reason not to.

Like wanting to edit existing files that are in other formats? Like wanting a simple human-readable file that works with "cat" or "less" etc? Like me wanting to write a database back end!!!?

There are myriad reasons for not wanting to off-load to sqlite. There are myriad reasons for not wanting to offload full stop. All I want is the ability to reason about what my program is doing, and that includes what it is doing when writing to disk.

Oh - and I know bufferbloat kills networking speed, but haven't they started finding the same problem in other places - INCLUDING the disk subsystem? Might it not be a performance improvement to have a drain() syscall that empties the buffers before accepting new input? And that would also provide exactly the semantics we need to reason about safely writing to disk?

Cheers,
Wol

What about other filesystems?

Posted Jan 16, 2021 8:14 UTC (Sat) by khim (subscriber, #9252) [Link] (14 responses)

Situation: we have coarse-grained API which describes what can be done when and in what order.

Problem: even this coarse-grained API (with the exact semantic as described in the documentation) is hard enough to implement in practice thus filesystem developers make shortcuts — and then apps start relying on these shortcuts.

Solution: let's invent even more fine-grained, even harder to implement API — and then apps would be happy.

Is that about right?

I don't see how this can ever work. Applications have never used existing APIs correctly — because implementations allowed this. What makes you sure that would use your new, “better”, APIs correctly?

It would be the-same-as-before-only-worse.

Now, if we can implement in ext4 (ext5, xfs7, doesn't matter… any filesystem which is sufficiently popular for the developers to care) fine-grained enough handling of these hints supplied by your “better APIs” completely (as in: incorrect use would, actually, be visible to developers) then and only then pushing new APIs would make sense.

We are not yet even at stage where incorrect use fsync and co is visible to developers thus it's definitely too early to talk about more fine-grained API.

Or maybe, if with “fast transactions” we can ensure that misuse if that API would be visible to developers then it's time to introduce it and ask developers to test apps with it.

But please don't introduce API misuse of which developers couldn't easily detect. Fsync story is problematic enough as it is. Don't make it worse.

What about other filesystems?

Posted Jan 16, 2021 9:52 UTC (Sat) by matthias (subscriber, #94967) [Link] (13 responses)

> Situation: we have coarse-grained API which describes what can be done when and in what order.

I would not call POSIX API course-grained. It is a very low-level API that requires many syscalls and taking care about all the ordering issues with fsync.

> Problem: even this coarse-grained API (with the exact semantic as described in the documentation) is hard enough to implement in practice thus filesystem developers make shortcuts — and then apps start relying on these shortcuts.

As far as I know, filesystem developers have managed quite well of implementing the exact semantics. However the semantics have much implementation specific behavior. Apps should not rely on this, but the problem is that they did and do.

> Solution: let's invent even more fine-grained, even harder to implement API — and then apps would be happy.

The new API should not be more fine-grained. It should be more high-level. Transactional semantics are much easier to deal with on the app level. You basically need two new syscalls: start_transaction() and commit_transaction(). The semantics would be far more intuitive than what we have now, where you need to remember not only to fsync your file, but also the directory you created your file in and to do it in the correct order.

Of course this is harder to implement of the fs level, but it is easier to get right on the app level. And while there are thousands of apps, there are only a few filesystems. I see a clear benefit of moving complexity from the app level to the fs level. And you cannot get easy to implement APIs for both levels. Somewhere you need to hide the underlying complexity of the hardware (parallelism, NCQ, etc.).

> Is that about right?
>
> I don't see how this can ever work. Applications have never used existing APIs correctly — because implementations allowed this. What makes you sure that would use your new, “better”, APIs correctly?

Because high-level APIs should be easier and more intuitive to use.

> It would be the-same-as-before-only-worse.
>
> Now, if we can implement in ext4 (ext5, xfs7, doesn't matter… any filesystem which is sufficiently popular for the developers to care) fine-grained enough handling of these hints supplied by your “better APIs” completely (as in: incorrect use would, actually, be visible to developers) then and only then pushing new APIs would make sense.

This is not possible. But an API with less implementation specific behavior than POSIX should already help in getting the apps right.

> We are not yet even at stage where incorrect use fsync and co is visible to developers

Of course, this is not visible. After all, in almost all cases, everything goes well. The order in which data hits the disk only matters if a crash occurs in the very right (or wrong) moment.

> thus it's definitely too early to talk about more fine-grained API.

The API should not be more fine-grained. It should be easier to use. Even changing a simple config file in a consistent way is horrible in POSIX. With transactions this would look like: start_transaction() open() write() close() commit_transaction(). This looks much more intuitive to me than the POSIX way, where you need to take care of creating a new file, renaming it afterwards, and issuing the right fsync calls at the right times.

And now try to imagine dependent changes in two different files. With POSIX this requires you to implement some journaling on the app level, as there is no way of renaming two files atomically.

> Or maybe, if with “fast transactions” we can ensure that misuse if that API would be visible to developers then it's time to introduce it and ask developers to test apps with it.

This is the wrong way to go. To detect misuse will only be possible if the system crashes in the right moment. And of course also "fast transactions" have implementation specific behavior. This is inherent to the current API. If you want to test apps for compliance with this API, you will need some kind of fuzzer that randomly reorders everything to the extend allowed by POSIX and then crashing at some random point of time.

> But please don't introduce API misuse of which developers couldn't easily detect. Fsync story is problematic enough as it is. Don't make it worse.

Ordering issues in parallelism are always hard to detect. Not only for fs writes but also for multiple threads. If you want to avoid that entirely, then you need to use an implementation that is strictly ordered (and by no means parallel). I doubt that many people are willing to pay that price.

The current API is horrible for app developers. Let's create an API that is easier to get right. And yes this means it has to be harder for the fs developers.

Best,
Matthias

What about other filesystems?

Posted Jan 16, 2021 16:38 UTC (Sat) by Wol (subscriber, #4433) [Link] (9 responses)

> The current API is horrible for app developers. Let's create an API that is easier to get right. And yes this means it has to be harder for the fs developers.

AND YOU CAN'T EVEN RELY ON JOURNALLING because you don't know whether the file system has written the journal before, after, or in the middle of writing the data.

Really, all I want is something like fsbarrier(), which GUARANTEES that stuff written after it is written after stuff that was written before it. I don't give a monkeys whether the filesystem batches, parallelises, or what ever other O_PONIES writes, provided I can reason that this call makes sure my stuff hits the disk in the order I expect.

If I want to trash my application's performance with excessive use of fsbarrier(), that's my problem. If the OS expects me to trash EVERYONE ELSE'S performance with excessive use of fsync() or fsfsync(), then that's a BIG problem for the OS!

Oh - and wasn't advice about how to shut a system down always "# sync; sync; sync; halt"? So all of us old hands expect sync() to do a filesystem flush? And do you really expect me as a developer to do that after most writes when I expect something like that to bring the system to its knees?

Cheers,
Wol

What about other filesystems?

Posted Jan 16, 2021 21:10 UTC (Sat) by matthias (subscriber, #94967) [Link] (5 responses)

>AND YOU CAN'T EVEN RELY ON JOURNALLING because you don't know whether the file system has written the journal before, after, or in the middle of writing the data.

Journalling was primarily invented to ensure the integrity of the filesystem. I.e., to avoid a total loss of the filesystem in case of power loss/crash.

> Really, all I want is something like fsbarrier(), which GUARANTEES that stuff written after it is written after stuff that was written before it.

This would be quite nice. fsync() only guarantees ordering for data written to the given file descriptor. fsbarrier() would probably be easier to use for the app developer. No need to call it for every involved file descriptor. And yes, in many cases guaranteeing ordering would be enough. No need to actually force the data to the disk before the syscall can return.

> I don't give a monkeys whether the filesystem batches, parallelises, or what ever other O_PONIES writes, provided I can reason that this call makes sure my stuff hits the disk in the order I expect.
> If I want to trash my application's performance with excessive use of fsbarrier(), that's my problem. If the OS expects me to trash EVERYONE ELSE'S performance with excessive use of fsync() or fsfsync(), then that's a BIG problem for the OS!

Why should fsbarrier() be any different in this regard than fsync(). Neither of the two requires the system to cripple performance. And both of them can be implemented by just forcing a global filesystem sync. The performance of fsync is getting much better, as the developers actually use the freedom they have. But I am wondering why you expect filesystem developers to implement the (from a filesystem perspective) much harder fsbarrier() call more efficiently than the relatively straightforward fsync() call. fsbarrier() would probably require a major rewrite of the VFS layer to even be able to compute the list of files that are effected by such a call. Chances are good that developers will use similar shortcuts as they have done for fsync() for decades and performance of the whole system will cripple with such a call.

> Oh - and wasn't advice about how to shut a system down always "# sync; sync; sync; halt"? So all of us old hands expect sync() to do a filesystem flush? And do you really expect me as a developer to do that after most writes when I expect something like that to bring the system to its knees?

sync guarantees a full filesystem flush. No changes there. That is indeed a bit of overkill if you just require ordering. fsync used to be quite inefficient as well, but it is getting better in this regard. And I know nobody who suggests to use sync in normal apps. fsync should be enough if used correctly.

Best,
Matthias

What about other filesystems?

Posted Jan 16, 2021 21:42 UTC (Sat) by Wol (subscriber, #4433) [Link] (4 responses)

> Why should fsbarrier() be any different in this regard than fsync(). Neither of the two requires the system to cripple performance. And both of them can be implemented by just forcing a global filesystem sync. The performance of fsync is getting much better, as the developers actually use the freedom they have. But I am wondering why you expect filesystem developers to implement the (from a filesystem perspective) much harder fsbarrier() call more efficiently than the relatively straightforward fsync() call. fsbarrier() would probably require a major rewrite of the VFS layer to even be able to compute the list of files that are effected by such a call. Chances are good that developers will use similar shortcuts as they have done for fsync() for decades and performance of the whole system will cripple with such a call.

So let's say I want to guarantee - let's say ten or twenty - files have all flushed before I start writing the next file, can I do those fsync()s in parallel? Without having to spawn 20 threads and then wait on them all? Whatever, that's a lot of work.

And with an fsfsync, again does that provide the ordering guarantee? I've heard that yes it guarantees everything that's been written gets flushed, but does it put a hard barrier in (like my fsbarrier()), or does it just stall all new writes until all the old writes have been flushed, or does it just guarantee that everything written before the fsfsync is flushed but it doesn't stop newer writes being merged forwards and being caught up in the flush?

Because if fsfsync() puts that barrier in, I'm simply changing a synchronous fsfsync() to an asynchronous fsbarrier(), if it's the second option it's causing a performance impact on the system, and if it's the third option then my app has to do a synchronous call with the performance impact that implies.

Cheers,
Wol

What about other filesystems?

Posted Jan 17, 2021 4:41 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (3 responses)

It depends. Let's go through these:

- For fsync()'ing multiple files, the standard answer is "use a thread pool." This is also the standard answer to "I want asynchronous I/O like on Windows," so no surprise there.
- As the article mentions, they are discussing an "fsync multiple files" syscall, which will (probably) further alleviate this problem (if it actually happens).
- I'm not aware of any syscall called "fsfsync()," so I assume you meant syncfs(2). That function is not in POSIX, so all we have to go on is the note in that man page, which explicitly states that "sync() or syncfs() provide the same guarantees as fsync() called on every file in the system or filesystem respectively."
- POSIX says that sync(2) is not required to wait for the writes to complete before returning (unlike fsync()). As noted above, POSIX does not specify syncfs() at all.
- Arguably, a conforming implementation could implement sync() as a no-op, because POSIX says it causes outstanding data "to be scheduled for writing out" - but it was *already* scheduled for writing out.
- Therefore, if you want to be pedantically POSIX-correct, you should not use sync(2) at all, because it gets you exactly nothing according to the standard.
- Since syncfs() is already Linux-specific, you can rely on its Linux-specific guarantees, if you are in a position to call it in the first place.

What about other filesystems?

Posted Jan 18, 2021 5:34 UTC (Mon) by joib (subscriber, #8541) [Link] (2 responses)

I believe io_uring supports fsync, so that would be a way to do an asynchronous fsync on somewhat modern Linux.

What about other filesystems?

Posted Jan 18, 2021 7:16 UTC (Mon) by NYKevin (subscriber, #129325) [Link]

Well, you can also use aio_fsync(3), but that's basically just a crappier version of "use a thread pool."

IMHO this is a broader issue with aio(7) and not a problem with fsync in particular.

What about other filesystems?

Posted Jan 18, 2021 10:45 UTC (Mon) by farnz (subscriber, #17727) [Link]

io_uring does provide the primitives needed; there's IORING_OP_FSYNC (with IORING_FSYNC_DATASYNC to weaken from fsync to fdatasync) and IORING_OP_SYNC_FILE_RANGE for flushing caches asynchronously, and the IOSQE_IO_DRAIN and IOSQE_IO_LINK flags to order io_uring operations with respect to each other so that you can issue the fsync after all the related writes have been done.

What about other filesystems?

Posted Jan 18, 2021 13:05 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (1 responses)

> AND YOU CAN'T EVEN RELY ON JOURNALLING because you don't know whether the file system has written the journal before, after, or in the middle of writing the data.

The filesystem is going to write data before metadata, so that you won't have a file that's full of zeros (or worse, full of stale data including another user's cleartext password). With "old Unix" you could get a file that's full of trash after a power failure; I sure did. So if anything journalling makes things better.

What about other filesystems?

Posted Jan 21, 2021 19:42 UTC (Thu) by mstone_ (subscriber, #66309) [Link]

Yeah, the confusion here comes from comparing the current state of affairs to a past state that didn't exist.

What about other filesystems?

Posted Jan 20, 2021 17:44 UTC (Wed) by anton (subscriber, #25547) [Link]

Oh - and wasn't advice about how to shut a system down always "# sync; sync; sync; halt"?
No, the advice was to type
sync
sync
sync
halt
That's because sync did not block (unlike on Linux), so the time you needed to type the additional syncs and the halt was needed to finish the sync.

What about other filesystems?

Posted Jan 16, 2021 20:49 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (2 responses)

The fundamental problem with this argument is that the API you describe can be (and has been) implemented in userspace (in the form of SQLite, as well as numerous "real" databases). Therefore, if you want to argue in favor of doing this in kernel space, it is not enough to argue that a new API would be "better" in various ways. You need to *specifically* address one question: Why should anyone re-implement already working userspace code in the kernel? Would it provide some performance advantage? Would it somehow enable you to do things that you can't currently do? Or would it just be "more convenient?" If the latter, how is that the kernel's problem?

What about other filesystems?

Posted Jan 17, 2021 17:40 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

What I find hard to understand is, if the database (SQLite, whatever) is using linux syscalls, how does it know the data has actually been written? Or does it do loads of sync()s, and then pause all writes for ten seconds or so waiting for the data to flush, etc etc.

I can see how databases can provide 99.999% reliability. I'm active on the raid list. I know all about disk timeouts, disks lying, how long things take to get flushed, etc etc. I simply do not see how an application can guarantee safety.

As for "why should it be in the kernel" - because LOTS of developers will benefit from the ability to reason about the state of a system in a crash scenario. Why should all the database developers be forced to duplicate each others' work?

And frankly, if I commit something to the filesystem for saving, surely I should be able to ask the filesystem "have you saved it?" AND BE ABLE TO RELY ON THE ANSWER! (Yep, I know disks lie, and I don't expect the file system necessarily to deal with that, but it really should be held responsible for its own actions!)

Cheers,
Wol

What about other filesystems?

Posted Jan 17, 2021 22:13 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

The process that SQLite uses is documented in https://sqlite.org/atomiccommit.html in a very high level of detail.

TL;DR: They make a copy ("rollback journal") of the data they are about to overwrite, fsync that copy, overwrite the data, fsync the database itself, and finally delete the rollback journal.

What about other filesystems?

Posted Jan 16, 2021 8:51 UTC (Sat) by matthias (subscriber, #94967) [Link] (9 responses)

> All this talk about "apps should use fsync" or "apps should use fsfsync" fills me with horror, as someone who wants to write a system that relies on data integrity. Are you telling me that my app needs to be filesystem-aware, and not only that but aware of what mount options were used, so I know which commands to call to make sure that my data is safe?

The app does not need to be filesystem-aware. If it closely follows the POSIX standard, the data should be safe on any filesystem, of course excluding ancient filesystems like FAT. Unfortunately, the POSIX standards is not very intuitive when it comes to data ordering and lacks some features. If your app chooses to ignore the POSIX standard, then you depend on the filesystem implementation and you are at the risk that things break when this implementation changes.

> And no, I don't want to offload that onto the glibc guys either.
> I know it'll be hard, but is there any way we can get the VFS to dictate that certain things are supposed to happen in a certain order.

What I would like to have are transactional semantics with ACID properties. Unfortunately, I do not see the VFS interface to be extended by transactions in any foreseeable future. The ways in which config files are updated always was a dirty workaround. Even if you strongly order: create new file, write new file, rename new file to old file, there is still the possibility of ending up with old file and new file existing in parallel after a crash. If this would be an atomic transaction, this could not happen. And it should even be more efficient to implement on the fs level, as there would be no need to fsync after every step.

> As an app, I don't care *how* the OS does it, but I want to be able to reason about what's hit the disk and when.

According to POSIX, some data has hit the disk after fsync() returned.

> And I DON'T want to have to worry about what the filesystem is or how it does it.

Of course. But this is only possible if you closely follow POSIX rules. If you do not follow these rules closely, you will always depend on implementation details.

> I've said it before, but I don't care too much about fsync or fsfsync, and I really don't want to have to worry about how the OS will react to those calls -

Then you should not program at the POSIX level. The data safety according to POSIX always relied on using the ordering primitives like fsync. With some filesystems you can be lucky that it works without fsync, but there never was a guarantee. You can still use some safe abstraction that takes care of your needs. SQlite was already suggested. It will happily do all the needed fsync calls for you and your data should be safe.

> one only has to go back to the ext3/ext4 transition/debacle where code that was fast on ext3 brought ext4 to its knees... if I can know for certain that stuff hits the disk in the order I write it, and tell the system what it can write in parallel and what it can't ... (basically data can be written in parallel, logs can be written in parallel, just not in parallel with each other. And the OS needs me to tell it the difference between the two, it won't know by itself.)

This can only be done by an API extension on the VFS level. As I already said, I would like to have transactional semantics for filesystems, but it is quite unlikely that we will get these.

Best,
Matthias

What about other filesystems?

Posted Jan 16, 2021 18:13 UTC (Sat) by Wol (subscriber, #4433) [Link] (8 responses)

> The app does not need to be filesystem-aware. If it closely follows the POSIX standard, the data should be safe on any filesystem, of course excluding ancient filesystems like FAT. Unfortunately, the POSIX standards is not very intuitive when it comes to data ordering and lacks some features. If your app chooses to ignore the POSIX standard, then you depend on the filesystem implementation and you are at the risk that things break when this implementation changes.

Are you sure!? As I understood POSIX, it dictates what an app sees on a properly functioning system. If my app follows POSIX, it will work fine so long as the underlying system is working fine. As far as POSIX was concerned (dunno if it's changed), if the system crashed then that was "undefined behaviour", with all the same downsides as the C definition of undefined behaviours.

So when we're talking about system crashes, POSIX is the wrong spec ...

Cheers,
Wol

What about other filesystems?

Posted Jan 16, 2021 18:37 UTC (Sat) by matthias (subscriber, #94967) [Link] (7 responses)

> Are you sure!? As I understood POSIX, it dictates what an app sees on a properly functioning system. If my app follows POSIX, it will work fine so long as the underlying system is working fine. As far as POSIX was concerned (dunno if it's changed), if the system crashed then that was "undefined behaviour", with all the same downsides as the C definition of undefined behaviours.

Actually this was very well what could happen and still can happen with non-journaling filesystems. Plugging the power-cord in the wrong moment and you have a broken filesystem that cannot not be mounted. With a bit of luck you can get your files back with fsck. Fortunately, the situation improved. In a journaling filesystem you should in any way only loose data in modification.

fsync() was added for precisely one reason: to have less undefined behavior in the case of a crash. As long as the system is properly functioning, there is no need for fsync() at all. Unfortunately, not all systems are properly functioning. Furthermore, not all people are happy with the semantics of the good old days that said: undefined behavior, all your data might be lost after a power outage.

> So when we're talking about system crashes, POSIX is the wrong spec ...

Nowadays, POSIX can give some guarantees on what can happen in certain situations. If you do not use fsync() your writes might be hitting the disk in some random order and what you can see after a crash is any intermediate state of this random order. If you use fsync(), you are sure that the fsynced data actually hit the disk and you have some control of the order. And if I got this correctly, the whole discussion is about the order in which data hits the disk and what can happen in case of a system crash. None of what we are discussing has any relevance in a properly working system, as after umount, all data has hit the disk and in a properly functioning system all disks are unmounted before the power is turned off.

Of course, even todays newer spec (which include fsync) do not help if the reason for the system crash was the hard drive catching fire or the storage driver turning mad. But it should help a lot for more common causes of crashes like power outage or the kernel panicking.

Cheers,
Matthias

What about other filesystems?

Posted Jan 17, 2021 17:58 UTC (Sun) by Wol (subscriber, #4433) [Link] (6 responses)

> Actually this was very well what could happen and still can happen with non-journaling filesystems. Plugging the power-cord in the wrong moment and you have a broken filesystem that cannot not be mounted. With a bit of luck you can get your files back with fsck. Fortunately, the situation improved. In a journaling filesystem you should in any way only loose data in modification.

And this is pretty much the perfect example of what is wrong with the current setup. The filesystem journal is there to protect the filesystem, to hell with the user's data. So HOW as a database developer am I supposed to protect my database (other than writing to a raw partition!) if I can't trust the filesystem to protect my user-space journal!

That's why ext3 journal="ordered" was so good - it gave the APPLICATION DEVELOPERS the guarantee that, after a crash, writes *appeared* to have been written to disk in the order that they were made. That's ALL a developer needs!

(Oh, and I don't think my work on raid would help that, even if raid could provide those guarantees which I hope it can, that an application could rely on it unless the filesystem ALSO provided those guarantees.)

Cheers,
Wol

What about other filesystems?

Posted Jan 17, 2021 21:03 UTC (Sun) by matthias (subscriber, #94967) [Link] (5 responses)

>> Actually this was very well what could happen and still can happen with non-journaling filesystems. Plugging the power-cord in the wrong moment and you have a broken filesystem that cannot not be mounted. With a bit of luck you can get your files back with fsck. Fortunately, the situation improved. In a journaling filesystem you should in any way only loose data in modification.

>And this is pretty much the perfect example of what is wrong with the current setup. The filesystem journal is there to protect the filesystem, to hell with the user's data.

If the filesystem gets corrupted, then all user data is lost, so indeed the journal is there to protect the data.

>So HOW as a database developer am I supposed to protect my database (other than writing to a raw partition!) if I can't trust the filesystem to protect my user-space journal!

You can trust it. Do an fsync on the user-space journal. Anything else will not work and has never worked. Ok, especially if you are developing databases, O_DIRECT might be a viable alternative. And writing to a raw partition also does not guarantee that the writes are in order. You still have to flush the cashes to ensure that something actually has hit the disk, or use synchronous IO from the beginning.

>That's why ext3 journal="ordered" was so good - it gave the APPLICATION DEVELOPERS the guarantee that, after a crash, writes *appeared* to have been written to disk in the order that they were made. That's ALL a developer needs!

This is clearly wrong. data=ordered only ensures that data is written before related meta-data. Writes to different files are not guaranteed to be ordered. Overwriting a file is not guaranteed to be ordered. The blocks can hit the disk in a random order. The things that are guaranteed by data=ordered is that when appending a file the new data hits the disk before the new blocks are added to the inode, i.e., it is not possible that the file contains garbage instead of the new data. And I think it is guaranteed that a rename only is done after the renamed file has hit the disk. This of course helps for those people that update data by the create new file, rename model. But there is not much that really helps for databases.

>(Oh, and I don't think my work on raid would help that, even if raid could provide those guarantees which I hope it can, that an application could rely on it unless the filesystem ALSO provided those guarantees.)

The work on raid is essential, as of course all layers below the filesystem have to provide certain data safety features. Especially the filesystem requires some kind of transactional semantics for its own journal.

Best,
Matthias

P.S.: Actually database developers are probably among the first to scream in terror if a filesystem will guarantee strong ordering properties because of the immense performance penalties, which they do not want to buy. At least the big systems know pretty well which data has which ordering requirements, which data should be in memory cache or will likely not be used again soon and want to control all these aspects themselves. If they do not use a raw partition from the beginning, they will just use the filesystem to reserve a bunch of blocks and then use synchronous IO.

What about other filesystems?

Posted Jan 17, 2021 21:18 UTC (Sun) by Wol (subscriber, #4433) [Link] (2 responses)

> Actually database developers are probably among the first to scream in terror if a filesystem will guarantee strong ordering properties

I did say the *appearance* of strong ordering guarantees :-)

> If the filesystem gets corrupted, then all user data is lost, so indeed the journal is there to protect the data.

But if the data in flight is corrupted, then the only way to get the system back to a sane (for the user) state may be "format, recover backup". Yes making sure the filesystem is itself consistent is important but it's only part of the picture. If I can't trust the state of the data, I need to recover from backup.

Cheers,
Wol

What about other filesystems?

Posted Jul 23, 2021 17:55 UTC (Fri) by andresfreund (subscriber, #69562) [Link] (1 responses)

You can trust the data if you tell the OS you want to pay the price (use fsync, or O_SYNC/O_DSYNC) and you only rely on data known to be flushed after a crash.

The alternative you're proposing basically implies that writes cannot be buffered for any meaningful amount of time. Oh, your browser updated it's history database? Let's just make that wait for all temporary files being written out, the file being copied concurrently, etc. And what's worse, do not allow any concurrent sync writes to finish (like file creation), because that would violate ordering.

Ext3's ordering guarantees were weaker and yet lead to horrible stalls all the time. There constantly were complaints about Firefox's SQLite databases stalling the system etc.

The OS/FS aren't magic.

What about other filesystems?

Posted Jul 23, 2021 18:01 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

> The OS/FS aren't magic.

In particular they have no reliable way of knowing which files (or even parts of files) are related and need constrained ordering between writes, and which are unrelated and thus can handled independently.

What about other filesystems?

Posted Jul 23, 2021 13:10 UTC (Fri) by Defenestrator (guest, #153400) [Link]

> And I think it is guaranteed that a rename only is done after the renamed file has hit the disk.

This is often true in practice (in particular, in ext3 and in ext4 outside of early versions), but not always explicitly guaranteed. See the auto_da_alloc option added to ext4 for more info and background.

What about other filesystems?

Posted Jul 23, 2021 17:43 UTC (Fri) by andresfreund (subscriber, #69562) [Link]

> Do an fsync on the user-space journal.

With a small bit of care fdatasync() should be enough, and will often be a lot faster (no synchronous updating of unimportant filesystem metadata like mtime, turning a single write into multiple).

> Anything else will not work and has never worked. Ok, especially if you are developing databases, O_DIRECT might be a viable alternative.

O_DIRECT on its own does not remove the need for an fsync/fdatasync. Devices have volatile write caches, and they do loose data on power loss/reset. O_DIRECT alone only avoids issues with the OS write cache. And makes the f[data]sync cheaper, because it will often only have to send a cache flush (and transparently avoid that on devices without volatile write caches).

Alternatively it can be combined with O_DSYNC to achieve actually durable writes - but if one isn't careful that can tank throughput. It either adds the FUA flag to each write or does separate sync commands after each write, which can end up being more cache flushes for a workload. It can be significantly faster to do a series of writes and then a single cache flush.

It's hardware dependant too whether FUA or separate cache flush commands are faster :(. Dear Samsung, please fix your drives.

My testing says that on most NVMe devices with a volatile cache DSYNC wins if the queue depth is very low and latency is king (only one roundtrip needed). fdatasync wins if there's more than a few writes happening at once, especially if all/most need to complete before the "user transaction" finishes - the lower number of flushes saves more than the additional roundtrip costs.

> Actually database developers are probably among the first to scream in terror if a filesystem will guarantee strong ordering properties because of the immense performance penalties, which they do not want to buy.

Indeed! And there's no realistic way the FS can do better.

What about other filesystems?

Posted Jan 17, 2021 0:53 UTC (Sun) by orib (subscriber, #62051) [Link]

> All this talk about "apps should use fsync" or "apps should use fsfsync" fills me with horror, as someone who wants to write a system that relies on data integrity. Are you telling me that my app needs to be filesystem-aware, and not only that but aware of what mount options were used, so I know which commands to call to make sure that my data is safe?

No, the opposite: your app needs to be aware of the *STANDARDS* that the filesystem is attempting to conform to, rather than what the implementation of the day happens to do.

What about other filesystems?

Posted Jan 18, 2021 17:05 UTC (Mon) by smoogen (subscriber, #97) [Link]

From reading from your other posts on this, I think you also have to be aware of what the hardware the system has in it if you want that level of guarantee. There are large number of hardware pieces who for the sake of speed fool the OS into saying stuff was written but wasn't. You can fsync(), sync(), and all otehr things and the hardware will write stuff in the order it wants. And yes like bufferbloat it is built into all parts of modern hardware from the cache on the harddrive, to the controller of the harddrive to the PCI bus the controller is connected to the memory subsystem.

In the end for a lot of hardware you just have to throw away guarenteeing a lot of things because the entire industry has agreed to lie in order to give the feeling of speed. To get around this you start having to move back to Real Time hardware which is much slower (less smart caches which could break RT guarantees) but more predictable. Not the answer you probably want to hear.. but I think a good reason the kernel people don't see the lies as a problem any more is that finding hardware post 2004 which doesn't lie in someway to its calling system is rare.

What about other filesystems?

Posted Jan 20, 2021 18:51 UTC (Wed) by anton (subscriber, #25547) [Link]

All this talk about "apps should use fsync" or "apps should use fsfsync" fills me with horror, as someone who wants to write a system that relies on data integrity. Are you telling me that my app needs to be filesystem-aware, and not only that but aware of what mount options were used, so I know which commands to call to make sure that my data is safe?
I think that the Ted Ts'o position is that you should use fsync a lot, independent of the file system; and that you should not just use it on the file you were working on, but possibly on other files; he explicitly mentions the directory that contains the file, but my expectation is that it might also affect other files, depending on the file system. E.g., some file systems have a file that contains the inodes, so why not require that the application also fsyncs that?

My position is pretty much the opposite: A good file system should provide decent consistency guarantees in case of an OS crash or power outage (which, BTW, is not covered by POSIX, so any claim by Ted Ts'o that POSIX requires applications to use fsync the way he likes is nonsense). And these consistency guarantees should orient themselves on the guarantees that file systems give in the non-crash case. This is the usual case, which is what programmers develop for and test against; so even if Ted Ts'o's file systems would guarantee to corrupt all non-fsynced data in case of a crash, it is unlikely that all applications would contain so many fsyncs that Ted Ts'o would not blame them on data loss.

Given that POSIX guarantees sequential consistency for file system operations in the non-crash case, my take is that file systems should guarantee a sequential consistent state in case of a crash; but for performance reasons, not the state at the time of the crash. You can use sync whenever you want the crash state to be consistent with the logical state, e.g., before reporting the completion of a transaction over the network.

Unfortunately, there is only one file system in Linux that gives such a guarantee last I looked (NILFS2), but that's because we let Linux get away with file systems that don't give such guarantees; admittedly Linux crashes so rarely these days and at least around here power outages are so rare that the itch is small.

Concerning whether apps should be file system aware, my take is that you should program for a filesystem that gives good guarantees, and recommend that the user should not be using other file systems (I have certainly avoided any file system maintained by Ted Ts'o since the O_PONIES discussion). Or, if you feel like accomodating bad file systems, have a flag for the application that insert an fsync of the file and its directory after every change of the file (that may satisfy Ted Ts'o for now); the user can use that flag when using a bad file system.

Concerning efficiency, this can be implemented efficiently, by writing the changes (data blocks and/or journal entries) to free blocks in arbitrary order, then a write barrier, then a root block that allows finding all this stuff, and once that root block reaches the platter, the recovery code will find everything. This does not even require any synchronous operation in principle, so it could be very efficient (in practice, I think that the barrier operations on existing drives are synchronous).

Concerning lying hardware: If the hardware lies, all the fsyncs that Ted Ts'o asks for will not help you. Just as we need file systems with decent guarantees, we need honest hardware. And I expect it exists, because otherwise database servers would be unable to guarentee consistency after crashes.

Fast commits for ext4

Posted Jan 16, 2021 15:54 UTC (Sat) by mss (subscriber, #138799) [Link]

> However, this change broke some existing tests that depend on fsync() working as a global barrier, so he backed it out.

It would be interesting to know how fsync() is implemented on other filesystems than ext*, whether it guarantees committing other files, too.

Looking at btrfs and xfs code it seems that they only sync the current inode when handling f(data)sync() calls so applications depending on fsync() being a FS-wide barrier are probably broken on these filesystems, too.

PostgreSQL might benefit from a fd-list fsync() API

Posted Jan 19, 2021 4:36 UTC (Tue) by ringerc (subscriber, #3071) [Link] (3 responses)

I think PostgreSQL might benefit significantly from a fsync() variant that takes a fd-list. Though the project is considering moving to syncfs() on Linux. It generally only cares about flushes to many files when it's doing checkpoints, and ordering of individual flushes isn't important for those.

Pg has three major write/flush order requirements.

1. For WAL (its journal): WAL record writes must be flushed to disk strictly in ascending byte order in the WAL files. Not all WAL record writes must be immediately durable, but all writes prior to a durable record like a commit must complete before the durable record is flushed. PostgreSQL uses either O_DATASYNC or fdatasync() to ensure this.

2. Ordering of non-WAL writes with respect to their corresponding WAL record writes. Non-WAL writes such as heap pages, index pages, clog (commit state) must not be flushed to disk until the corresponding WAL writes are known to be durable. To do this, PostgreSQL buffers the data to be written in its own shared memory, delaying any write() to the OS until it knows the WAL for these writes is durable. This causes significant double-buffering waste and write latency.

3. Checkpoint WAL segment removal vs non-WAL writes. PostgreSQL must know that all non-WAL writes corresponding to WAL records up to a given point in the WAL are durably on disk before it removes or overwrites a WAL segment (file). PostgreSQL does this by remembering each file descriptor that was touched since the last checkpoint and fsync()ing it before removing any WAL segments.

If this sounds inefficient to you, that's because it is.

The O_DATASYNC or fdatasync() after many WAL record writes can limit overall throughput of sync-sensitive writes like commit records, especially since sometimes a large volume of less critical records must be flushed before the critical record is durably flushed. WAL writing spends a lot of time waiting because there's no API that lets us pipeline or queue up fsync()s in a non-blocking manner - except possibly helper threads for blocking calls, and they cause plenty of other issues.

For non-WAL (heap) writes, PostgreSQL has to tie up memory for its pending writes until it knows the OS has flushed the corresponding WAL writes. Only then can it write() them, because it has no way to tell the OS that they must not hit disk before the corresponding WAL records are durable. That means it has to buffer them for longer, and double handle the writes.

Finally, for checkpoints PostgreSQL must fsync() all the dirty files it has open. This will force all pending writes on those files to disk, not just the writes made before the current checkpoint target position. On most Linux FSes it also flushes a bunch of other irrelevant and uninteresting files that don't need to be durable at all.

What I'd really like to have is a way to tell the FS about write-order dependencies when PostgreSQL issues a write().

Postgres would request a tag for each WAL record write, and then tag each write that depends on that WAL record with an ordering requirement against that WAL write tag. Then if WAL ordering was critical for a given WAL write it'd tag the next WAL write with an ordering requirement against the previous critical WAL write. So the OS would be free to write in natural order:

* WAL record A
* Heap changes for A
* WAL record B
* Heap changes for B

or reorder:

* WAL record A
* WAL record B
* Heap changes for A and B in any combination

but could not write heap changes for A before WAL A, or heap changes for B before WAL B if postgres specified the A before B write-order requirement.

That'd let Pg just write everything to the OS without blocking on fsync() etc, and let the kernel's dirty writeback and VFS/block layer sort out the ordering.

Basically a kind of AIO with the ability to impose write ordering requirements and with a sensible interface for confirming data is durably flushed. The latter is woefully lacking in any current AIO facilities.

It'd be necessary to guarantee that any read of a file with pending writes always returned the latest data that's pending writeback. Otherwise Pg would have to pin each dirty buffer in its own buffer cache until it knew it was flushed by the OS anyway.

I hope that the recent blk-mq work may well benefit PostgreSQL if the needed balance between permitting reordering and imposing necessary write-order constraints proves to be possible.

Note that Pg can't use FS-level write-order barriers after each WAL record written. It'd eliminate the latency after each fsync(), but would be grossly inefficient for this because it'd prevent reordering of non-WAL writes across barriers, and we want the FS to be free to reorder non-WAL writes as aggressively as possible (up to the next checkpoint) in order to do write combining and write deduplication.

PostgreSQL might benefit from a fd-list fsync() API

Posted Jan 19, 2021 14:51 UTC (Tue) by Wol (subscriber, #4433) [Link] (1 responses)

> Finally, for checkpoints PostgreSQL must fsync() all the dirty files it has open. This will force all pending writes on those files to disk, not just the writes made before the current checkpoint target position. On most Linux FSes it also flushes a bunch of other irrelevant and uninteresting files that don't need to be durable at all.

Does PG tend to have a lot of these (dirty files)? Because Pick, the database I'm thinking of, typically stores a single "table" in one OR MORE os files, so I could have a *lot* of dirty files open ...

Although of course, like PG, the main thing I'm concerned about is knowing that the journal is flushed ...

Cheers,
Wol

PostgreSQL might benefit from a fd-list fsync() API

Posted Jan 22, 2021 1:15 UTC (Fri) by ringerc (subscriber, #3071) [Link]

Yes. PostgreSQL may have many dirty FDs. Each tables or index is stored as separate file, and split into 1GB extents. I imagine that the extent splitting could be changed if there's a benefit to doing so, but the separate files per relation not so much.

PostgreSQL might benefit from a fd-list fsync() API

Posted Jan 19, 2021 14:55 UTC (Tue) by Wol (subscriber, #4433) [Link]

> Note that Pg can't use FS-level write-order barriers after each WAL record written. It'd eliminate the latency after each fsync(), but would be grossly inefficient for this because it'd prevent reordering of non-WAL writes across barriers, and we want the FS to be free to reorder non-WAL writes as aggressively as possible (up to the next checkpoint) in order to do write combining and write deduplication.

Does PG often spend time io-bound? How important really is "as efficient as possible" disk io?

I would have thought this issue would actually impact me even more, given that traditionally Pick has always been "don't bother with caching, it's faster to retrieve it from disk" and Pick can also really hammer the disk.

Cheers,
Wol

Fast commits for ext4

Posted Jan 19, 2021 14:24 UTC (Tue) by jan.kara (subscriber, #59161) [Link] (1 responses)

> In the default data=ordered mode, where the journal entry is written only after flushing all pending data, delayed allocation might thus delay the writing of the > journal.

This is actually not quite correct. Delayed allocation just means that write(2) stores data in the page cache without actually allocating blocks on disk. This also means that the journalling machinery is completely ignorant of the write at this moment. Later, when VM decides to write out dirty pages from the page cache, filesystem allocates blocks for the pages and it is only at this point that there are filesystem metadata changes that are journalled. So it isn't true that "delayed allocation may delay writing of the journal".

Fast commits for ext4

Posted Feb 8, 2021 15:58 UTC (Mon) by mrybczyn (subscriber, #81776) [Link]

Thank you for the comment, Jan. We've clarified the sentence.


Copyright © 2021, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds