Improved block-layer error handling [LWN.net]

By Jonathan Corbet
June 2, 2017

The kernel's filesystem and block layers are places where a lot of things can go wrong, often with unpleasant consequences. To make things worse, when things do go wrong, informing user space about the problem can be difficult as a consequence of how block I/O works. That can result in user-space applications being unaware of trouble at the I/O level, leading to lost data and enraged users. There are now two separate (and complementary) proposals under discussion that aim to improve how error reporting is handled in the block layer.

Block-layer error codes

One problem with existing reporting mechanisms is that they are based on standard Unix error codes, but those codes were never designed to handle the wide variety of things that can go wrong with block I/O. As a result, almost any type of error ends up being reported back to the higher levels of the block layer (and user space) as EIO (I/O error) with no further detail available. That makes it hard to determine, at both the filesystem and user-space levels, what the correct response to the error should be.

Christoph Hellwig is working to change that situation by adding a dedicated set of error codes to be used within the block layer. This patch set adds a new blk_status_t type to describe block-level errors. The specific error codes added thus far correspond mostly to the existing Unix codes. So BLK_STS_TIMEOUT, indicating an operation timeout, maps to ETIMEDOUT, while BLK_STS_NEXUS, describing a problem connecting to a remote storage device, becomes EBADE ("invalid exchange"). There is, according to Hellwig, "some low hanging fruit" that can be improved by additional error codes, but those codes are not added as part of this patch set.

The new errors can be generated at the lowest levels of the kernel's block drivers, and will be propagated to the point that filesystem code sees them in the results of its block I/O requests. To get there, the bi_error field in struct bio, which contained a Unix error code, has been renamed to bi_status. In-tree filesystems have been changed to use the new field, but they do not yet act on the additional information that may be available there.

This is, in other words, relatively early infrastructural work that makes it possible for the block layer to produce better error information. Actually making use of that infrastructure will have to wait until this work is accepted and headed toward the mainline.

Reporting writeback errors

One particular challenge for block I/O error reporting is that many I/O requests are not the direct result of a user-space operation. Most file data is buffered through the kernel's page cache, and there can be a significant delay between when an application writes data into the cache and when a writeback operation flushes that data to persistent storage. If something goes wrong during writeback, it can be hard to report that error back to user space since the operation that caused that writeback in the first place will have long since completed. The kernel makes an attempt to save the error and report it on a subsequent system call, but it is easy for that information to be lost with the result that the application is unaware that it has lost data.

Jeff Layton's writeback-error reporting patches are an attempt to improve this situation. He adds a mechanism that is based on the idea that applications that care about their data will occasionally call fsync() to ensure that said data has made it to persistent storage. Current kernels might report a writeback error on an fsync() call, but there are a number of ways in which that can fail to happen. With the new mechanism in place, any application that holds an open file descriptor will reliably get an error return on the first fsync() call that is made after a writeback error occurs.

To get there, the patch set creates a new type (errseq_t) for the reporting of writeback errors. It is a 32-bit value with two separate fields: an error code (of the standard Unix variety) and a sequence counter. That counter tracks the number of times that an error has been reported in that particular errseq_t value; kernel code can remember the counter value of the last error reported to user space. If the counter increases on a future check, a new error has been encountered.

The errseq_t variables are added to the address_space structure, which controls the mapping between pages in the page cache and those in persistent storage. The writeback process uses this structure to determine where dirty pages should be written to, so it is a logical place to store error information. Meanwhile, any open file descriptor referring to a given file will include a pointer to that address_space structure, so this errseq_t value is visible (within the kernel) to all processes accessing the file. Each open file (tracked by struct file) gains a new f_wb_err field to remember the sequence number of the last reported error.

Storing that value in the file structure has an important benefit: it makes it possible to report a writeback error exactly once to every process that calls fsync() on that file, regardless of when they make that call. In current kernels, only the first caller after an error occurs has a chance of seeing that error information. It would arguably be better to report the error only to the process that actually wrote the data that experienced the error, but tracking things at that level would be cumbersome and slow. By informing all processes, this mechanism ensures that the right process will get the news.

The final step is to get the low-level filesystem code to use the new reporting mechanism when something goes wrong. Rather than convert all filesystems at once, Layton chose to add a new filesystem-type flag (FS_WB_ERRSEQ) that can be set for filesystems that understand the new scheme. Code at the virtual filesystem layer can then react accordingly depending on whether the filesystem has been converted or not. The intent is to remove this flag and the associated mechanism once all in-tree filesystems have made the change.

The ideas behind this patch set were discussed at the 2017 Linux Storage, Filesystem, and Memory-Management Summit in March; the patches themselves have been through five public revisions since then. There is a reasonable chance that they are approaching a sort of final state where they can be considered for merging in an upcoming development cycle. The result will not be perfect writeback error reporting, but it should be significantly better than what the kernel offers now.

Index entries for this article
Kernel	Block layer/Error handling

to post comments

Multiple drives

Posted Jun 2, 2017 17:49 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link] (9 responses)

What if there is a writeback error to a filesystem on /dev/sda, and an application does fsync() on a fd to a file on a filesystem on /dev/sdb? Would it get an error? I hope not.

Multiple drives

Posted Jun 2, 2017 17:53 UTC (Fri) by corbet (editor, #1) [Link]

An error will be returned only if the application calls fsync() on a file descriptor for a file that has experienced errors. Multiple drives are not an issue; errors should not propagate beyond the affected file even on a single drive.

Multiple drives

Posted Jun 2, 2017 17:54 UTC (Fri) by jlayton (subscriber, #31672) [Link] (7 responses)

No. Errors are stored a per-inode basis (well, per address-space, but most inodes have only a single address_space). A filesystem on /dev/sda would not have the same inodes as one on /dev/sdb, so that wouldn't occur.

Multiple drives

Posted Jun 3, 2017 6:17 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (6 responses)

Would it be possible to exclude O_DIRECT file descriptors from reporting writeback failures, or perhaps you are already doing that?

Multiple drives

Posted Jun 3, 2017 9:53 UTC (Sat) by jlayton (subscriber, #31672) [Link] (5 responses)

That's really not related to the changes we're making here, but it is possible to do so.

Ultimately, an fsync syscall returns whatever the filesystem's fsync operation returns, so if the filesystem wants to check for O_DIRECT and always return 0 without flushing, then it can do so today.

Now, that said...one wonders why an application would call fsync on an O_DIRECT fd?

Multiple drives

Posted Jun 4, 2017 4:02 UTC (Sun) by neilbrown (subscriber, #359) [Link] (4 responses)

> Now, that said...one wonders why an application would call fsync on an O_DIRECT fd?

To ensure that the metadata is safe? I think you need O_SYNC|O_DIRECT if you want to not use fsync at all.
See "man 2 open"

Multiple drives

Posted Jun 4, 2017 19:32 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (3 responses)

Also to ensure that the data is safe, because writes can stop at the disk cache and an fsync is needed to ensure it reaches the platters or the flash. This is represented as a REQ_FLUSH request (while metadata often are REQ_FUA, i.e. force unit access). REQ_FLUSH applies to all completed writes *before* the flush, while FUA applies to the write that had the flag only.

Multiple drives

Posted Jun 5, 2017 11:44 UTC (Mon) by jlayton (subscriber, #31672) [Link] (2 responses)

Thanks, that makes sense.

I don't quite see why you'd want to avoid reporting errors on a O_DIRECT fd in either case though. In both cases, it's possible that data previously written via that O_DIRECT file descriptor didn't make it to disk, so wouldn't you want to inform the application?

The big change here is that reporting those errors on the O_DIRECT fd won't prevent someone else from seeing those errors on via another fd. So, I don't quite see why it'd be desirable to avoid reporting it on the O_DIRECT one.

Multiple drives

Posted Jun 5, 2017 11:55 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (1 responses)

> I don't quite see why you'd want to avoid reporting errors on a O_DIRECT fd in either case though. In both cases, it's possible that data previously written via that O_DIRECT file descriptor didn't make it to disk, so wouldn't you want to inform the application?

I certainly would. :) However, I'm worried about the application using O_DIRECT seeing errors that happened while accessing the file via another fd.

In fact, if I understand correctly, those errors could even have happened before the O_DIRECT file descriptor had even been opened, if they have never been reported to userspace.

Multiple drives

Posted Jun 5, 2017 16:15 UTC (Mon) by jlayton (subscriber, #31672) [Link]

The patchset actually initializes the errseq_t in struct file to the value of the mapping's errseq_t at open time. So, in principle you shouldn't see errors that occurred prior to your open.

How mixed buffered and direct I/O are handled is not really addressed (or changed for that matter) in this set. Yes, you will quite likely see an error on an O_DIRECT fsync, but it's quite likely that you'll see that today anyway. Most filesystems make no distinction about whether you opened the fd with O_DIRECT or not. They flush the pagecache and inode anyway just like they would with a buffered fd.

The flip side of this (and the scarier problem) is that with the current code, it's likely that that fsync on the O_DIRECT fd would end up clearing the error such that a later fsync on the buffered fd wouldn't ever see it. That problem at least should be addressed with these changes.

Improved block-layer error handling

Posted Jun 2, 2017 18:59 UTC (Fri) by jlayton (subscriber, #31672) [Link]

To be clear, while I'm focusing on block-device based filesystems now, errseq_t based error handling is applicable for any sort of filesystem. I expect that almost all of them will end up being converted to use errseq_t for tracking errors, whether block-based or not.

Improved block-layer error handling

Posted Jun 2, 2017 23:55 UTC (Fri) by jhoblitt (subscriber, #77733) [Link] (1 responses)

Will errors from dm-mapper, lvm, and/or luks float up or will those abstractions layers essentially hide error reporting?

Improved block-layer error handling

Posted Jun 3, 2017 9:38 UTC (Sat) by jlayton (subscriber, #31672) [Link]

They should float up.

fsync is called on a file descriptor, which is ultimately an open file on some sort of filesystem. When there is an error, the filesystem is ultimately responsible for marking the mapping for the inode with an error (sometimes this is handled by lower layers common code, like the buffer.c routines). When fsync is called, the filesystem should check for an error since we last checked via the file, report it if there was one and advance the file's errseq_t to the current value.

Note that the way errors get recorded is not terribly different from what we do today. The big difference is in how we report errors at fsync time. Most of the changes to filesystems are in fsync here, though I am going through various parts of the kernel and trying to make sure that we're recording errors properly when they occur.

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 3, 2017 0:36 UTC (Sat) by Richard_J_Neill (subscriber, #23093) [Link] (12 responses)

We recently hit a bug where the disk had plenty of free space, but couldn't create new files, making the server unusable. It turns out, we'd run out of inodes (due to a misbehaving web-app creating hundreds of 0-byte lock files per minute). It was really hard to diagnose this, because of the lack of any helpful messages. I'd have expected that, if the kernel encounters a hard error like this, it would have at least put something into dmesg or syslog (it didn't). The design philosophy seems to be that running out of Inodes is more akin to a permissions error (i.e. nothing wrong with the system), than to a fatal disk error, and that, while even a trivial usb hotplug event generates lots of log traffic, an unusable root filesystem (from inode exhaustion) is deemed not important enough to merit a log message!

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 3, 2017 4:18 UTC (Sat) by k8to (guest, #15413) [Link]

The applications are all told ENOSPC in this situation, so lots of them should be complaining and some of that should be hitting logs.

It's unclear to me that the kernel should also log for each such failure. It might be so noisy as to cause more breakage. I would want the system to do something like log when this situation is near-occurring and when it has occurred in some throttled way, which suggests monitoring logic. Should that be implemented in-kernel or in userland?

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 3, 2017 7:44 UTC (Sat) by MarcB (subscriber, #101804) [Link] (6 responses)

I don't think running out of free inodes is conceptually different from running out of free space. Also, it is not a problem per se from the kernel's PoV: It cannot be prevented, it does not happen at random, it is not something exceptional at all.
It is just another resource exhaustion that user space has to deal with - and perhaps even is dealing with, so nothing is actually wrong.

Also, this used to be much more common in the past, when many filesystems allowed much fewer inodes by default. So, perhaps some administrators simply have forgotten (or never learned) that inode exhaustion is a real thing.

And diagnosing this - once you are aware that it can happen - is not harder than diagnosing "out of space" (in practice: even easier, as is is unlikely that large numbers of inodes are held by deleted yet opened files).
It can, and should, also be monitored just like free disk space.

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 3, 2017 23:16 UTC (Sat) by Richard_J_Neill (subscriber, #23093) [Link] (5 responses)

Yes, you're right, excepting that the common "no space left on device" message is actually very misleading when there is in fact plenty of space.

Also, while the sysadmin can add extra monitoring and debugging, surely the point of a reliable system is to minimise the chance of human error.
We are used to the abstraction of a storage being "somewhere you can fill up with data"; the very existence of inodes should be no more the concern of the average programmer/sysadmin than the specifics of which pointer has which address... it should be "the computer's" problem, not "the operator's problem". If the computer is going to break that rule, and do so rarely, but catastrophically, the least it can do is to fail "noisily".

Anyway... in these days of LVM and resizeable volumes, why shouldn't the filesystem be able to automatically notice that it has lots of space but too few inodes, and automatically create more inodes as needed?

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 4, 2017 1:39 UTC (Sun) by rossmohax (guest, #71829) [Link] (1 responses)

that is exactly what XFS is doing, inodes are allocated dynamically and you can never run out of them as long as you have free space. Try using XFS instead of ext4, it is awesome

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 4, 2017 5:09 UTC (Sun) by matthias (subscriber, #94967) [Link]

Even XFS can run out of inodes. Inode numbers are mapped to blocks by a very simple mapping. It is roughly like every i-th block can have inodes. The possible inodes are just numbered starting by one. Once these blocks are filled (with inodes or data), XFS cannot create new inodes. Changing the mapping would change every single inode number.

We had once the following problem after growing a filesystem. Standard was at that time to only use 32-bit inode numbers. After growing the filesystem the 32-bit inode numbers where all in the already filled lower part of the filesystem.(*) Thus no new inodes could be created. Took a while to find that one only having the meaningful message "No space left on device.". Luckily it was a 64-bit system. Thus, we could just switch to 64-bit inode numbers. The other solution would have been to recreate the filesystem, not the quickest solution with a 56 TB filesystem.

That said the circumstances under which XFS runs out of inodes are very rare. So it would be very important to have meaningful error messages, to notice that one of these rare circumstances just happened.

(*) On fs creation XFS usually chooses the number i to be such that all possible inodes have 32-bit numbers. After growing this condition was not satisfied any more, as this number cannot be changed. On 32-bit systems, one would need to set this number i manually at fs creation time, if one wants to have the possibility to grow the filesystem.

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 4, 2017 14:15 UTC (Sun) by MarcB (subscriber, #101804) [Link] (2 responses)

Remember that the possible error codes for syscalls were defined by POSIX, so simply adding an EOUTOFINODES would be non-compliant and could easily do more harm then good, because in practice, ENOSPC is a good fit for "out of inodes" and software might actually expect it to cover both cases:

If the software is some kind of cache, discarding the files that are least relevant is a proper course of action for both kinds of ENOSPC.
If the software is some kind of archival system, moving the oldest files to the next tier of storage will also help in both cases.

If the software can't freely discard or move data, all it can do, is scream for help, anyway.

Also, an ENOSPC due to lack of inodes will usually happen on open() while an ENOSPC due to lack of disk space will usually happen on write() or similar.
So applications could already translate this to proper error messages. It is common that the same error code has different meaning for different syscalls and developers should know this.

Of course, ideally filesystems would solve this problem completely. In fact, some do: btrfs has an upper limit of 2^64 inodes, as does XFS or ZFS (might be 2^48).
btrfs is fully dynamic, i.e. each btrfs, that is large enough to hold the inode information, can in fact contain 2^64 inodes. XFS is dynamic enough in practice (make sure to use "inode64", though. Otherwise inodes can only be stored in the lowest 1 TB, and that space can run out if also used for file data - been there, done that). Even NTFS allows 2^32 and is also fully dynamic

The ext-family is the big exception. Theoretically, the limit is also 2^32, but it cannot allocate space for inodes dynamically, and thus uses much lower limits by default. Otherwise, each inode would consume 256 bytes, even if unused.

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 5, 2017 11:55 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Remember that the possible error codes for syscalls were defined by POSIX, so simply adding an EOUTOFINODES would be non-compliant and could easily do more harm then good, because in practice, ENOSPC is a good fit for "out of inodes" and software might actually expect it to cover both cases

It might well do more harm than good, but the first part of your statement is just wrong. POSIX.1 2008 states (and all previous versions have similar wording):

Implementations may support additional errors not included in this list, may generate errors included in this list under circumstances other than those described here, or may contain extensions or limitations that prevent some errors from occurring.
The ERRORS section on each reference page specifies which error conditions shall be detected by all implementations (``shall fail") and which may be optionally detected by an implementation (``may fail"). If no error condition is detected, the action requested shall be successful. If an error condition is detected, the action requested may have been partially performed, unless otherwise stated.
Implementations may generate error numbers listed here under circumstances other than those described, if and only if all those error conditions can always be treated identically to the error conditions as described in this volume of POSIX.1-2008. Implementations shall not generate a different error number from one required by this volume of POSIX.1-2008 for an error condition described in this volume of POSIX.1-2008, but may generate additional errors unless explicitly disallowed for a particular function.

So adding more errors is not only not noncompliant, it is both explicitly permitted and very common.

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 5, 2017 16:15 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

> So adding more errors is not only not noncompliant, it is both explicitly permitted and very common.

Yes, for *new* error conditions not specified by POSIX. However:

> Implementations shall not generate a different error number from one required by this volume of POSIX.1-2008 for an error condition described in this volume of POSIX.1-2008, ...

The error list for the open() and openat() system calls specifies ENOSPC as follows:

> [ENOSPC]
> The directory or file system that would contain the new file cannot be expanded, the file does not exist, and O_CREAT is specified.

So if "the filesystem ... cannot be expanded" is read to include the "out of inodes" condition (a reasonable interpretation IMHO) then POSIX requires open() to return ENOSPC for this condition, and not some other error code.

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 3, 2017 8:31 UTC (Sat) by matthias (subscriber, #94967) [Link] (2 responses)

As the other commenters, I agree that running out of inodes should not be a problem of the kernel. However the error reporting could be improved. Returning ENOSPC when the actual problem is running out of inodes is misleading. The user has to know that the error number is also used for other reasons than "No space left on device". Today, probably many users do not even know that they can run out of inodes. Even if they know this in theory, they have to remember this when seeing ENOSPC.

I would much prefer error reporting by exceptions. The type of the exception more or less corresponds to the error numbers and can be used by the program to determine how to react, but there is a string attached that can be passed up the call chain, which has meaningful information for the user. This way the program still gets the information contained in ENOSPC (actually most programs are fine to react to running out of space and running out of inodes in the same way), but the user which sees the error message knows instantly where to search for the problem.

Adding type inheritance to the exceptions additionally allows the program to select how fine grained the error information should be. Some programs are fine seeing an IO exception. Others want to differentiate whether the error is running out of resources or a real problem and some might want to know the difference between running out of space and running out of inodes.

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 4, 2017 1:42 UTC (Sun) by rossmohax (guest, #71829) [Link]

you don't need exceptions to have error inheritance.

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 4, 2017 3:39 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Exceptions imply some kind of a type system. I'd settle for something like: "error.filesystem.io.disk-space/required=1233/available=123" where I can use simple prefix matching to get more and more detailed error.

perhaps running out of inodes could be taken "more seriously"?

Posted Jun 3, 2017 10:32 UTC (Sat) by itvirta (guest, #49997) [Link]

Now that you learned about the issue of inodes running out, you know to add it to your monitoring.
It's very much the same as running out of disk space, which isn't that uncommon with some logging
getting out of hand either. Both can be checked with `df`.

Also, there's the possibility of distributing unrelated data on separate file systems, or using quotas to
protect the rest of the system from an application getting out of hand.

Improved block-layer error handling

Posted Jun 3, 2017 12:27 UTC (Sat) by nix (subscriber, #2304) [Link] (12 responses)

He adds a mechanism that is based on the idea that applications that care about their data will occasionally call fsync() to ensure that said data has made it to persistent storage.

I keep hearing this, but the problem is that not only is this not true, you don't want it to be true and would probably refuse to use any system on which it was true because its performance would be appalling. Obviously, yes, text editors should be (and are) very careful about fsync()ing your six hours of work now you finally remembered to save it -- but let's pick on another favourite test load of kernel hackers, compiling a kernel. It would be bad if a chunk of data was omitted from the middle of an object file, right? So clearly the assembler "cares about" its data in this sense. But, equally, an assembler that called fsync() on its output would be the subject of copious vile swearing: you don't want your massive 64-way compile to be fsync()ing all over the place, not even in a filesystem better-behaved than ext3 (where fsync sometimes == sync()). You want any sync to happen at the end, after everything is linked, and you're probably happy if nothing syncs at all much of the time (for test compiles, if the power goes out, you'll just rebuild). However, that doesn't mean you're happy if an I/O error replaces crucial hunks of the kernel with \0!

This is just the first example that springs to mind. There are probably many more. One thing that's become clear to me as I classify everything on my machines into 'I care about this, RAID-6 and bcache it' and 'I don't care about this, chuck it on an unjournalled RAID-0' is that not only is there currently no way for applications to indicate what is important in this sense, and there is also *no way for most of them to know at all*. Whether a given file write is important is a property of what the user plans to do with the file later.

(Another kernel-compile-variety case: I do a lot of quick checks of enterprise kernels, with all their modules. Each module_install writes about 3--4GiB of module data out to /lib/modules/$whatever/kernel. Obviously that's an important write, right? If it goes wrong the machine probably won't boot! Only it's not: 90% of those modules are never referenced again, and the whole lot is going onto a loopback filesystem on that RAID-0 array because I'm actually only going to use it once, for testing, then throw it away. There is no way the assember, the linker, install(1), or the kernel makefile could know that, but if it didn't know that it might e.g. in my case decide to cache all 3GiB on an SSD, or journal it all through the RAID journal, or fsync() each file individually, or something. And, of course, in most cases even the users don't bother to make this sort of determination, or don't have the knowledge to, even though they're the only ones who could.)

I do not see an easy way out of this. :(

Improved block-layer error handling

Posted Jun 3, 2017 13:57 UTC (Sat) by gdt (subscriber, #6284) [Link] (3 responses)

The rules for "correct" use of fsync() by applications' programmers are already not useful. If the program wasn't started interactively then it's best not to call fsync(), as a few thousand fsync() calls in a short time leads to substantial jitter. How you can tell if a program is being run interactively is no longer straightforward (is that HTTP POST from a person or a API). So there is a risk v benefit balance in programmer's minds when using fsync() for common file I/O, with a strong tendency towards "no" -- partly because of advocacy from kernel programmers, but also because fsync() historically works less well than suggested by the man page (eg, not clear to me if wear-leveling SSDs work for the case where fsync() is immediately followed by power-loss).

It's worse for library authors, as they have no idea of the significance of the data, and so if to implement the notion that "applications that care about their data will occasionally call fsync()". You might argue that databases should use fsync(). You'll recall that Firefox had an issue with adding unwelcome latency by storing bookmarks in a SQLite database which issued fsync().

For these reasons even if they "care" for the data a programmer might well choose not to call fsync() but simply close() the file and let the system proceed without added latency. On the plus side, applications' programmers already accept some asynchronicity in read(), write() and close() error reporting and perhaps this could be further extended.

Improved block-layer error handling

Posted Jun 3, 2017 14:38 UTC (Sat) by nix (subscriber, #2304) [Link] (2 responses)

not clear to me if wear-leveling SSDs work for the case where fsync() is immediately followed by power-loss

I believe that the only SSD that currently even tries to reliably handle power loss without at least the possibility of massive data loss, corruption, or outright device failure is Intel's fairly costly datacentre parts. So, 'no'. :(

Improved block-layer error handling

Posted Jun 3, 2017 15:23 UTC (Sat) by jhoblitt (subscriber, #77733) [Link] (1 responses)

If power-loss is a failure mode of significant concern for a non-distributed system, typically a "RAID controller" (may be in JBOD mode) with a BBU is used. That seems like a pretty reasonable engineering compromise as long as we don't have large quantities of non-volatile memory. If we had massive amounts high speed NVM, we probably wouldn't even need to worry about fsync()ing at all.

Improved block-layer error handling

Posted Jun 3, 2017 22:23 UTC (Sat) by nix (subscriber, #2304) [Link]

Well, the clear intention of journalled md is that SSDs with decent powerfail behaviour be used (good thing one such does exist: it even tells you in the SMART data if its capacitors are failing). It's also frankly damn stupid that any storage devices exist that can brick themselves on not exactly rare events (even places with excellent grids have a power failure or two a decade).

Improved block-layer error handling

Posted Jun 4, 2017 4:26 UTC (Sun) by neilbrown (subscriber, #359) [Link] (2 responses)

I think you are conflating two distinct but similar concepts - safety and integrity.

On the one hand you have applications that need to know that the data they have written is "safe". They need to know this so that they can tell someone. Maybe the editor tells the user "the file has been saved". Maybe the email system tells its peer "I have that email now, you can discard your copy". Maybe the database store is telling the database journal "that information is safe".
In each of these cases you need fsync() because you need to tell someone that the data is safe.

The C compiler or assembler doesn't need to tell anyone. But the linker does, as you say, want to know that if the file it is loading is the same as the file that the assembler wrote out. It doesn't care if the data was safe or not. It is perfectly acceptable for the linker to say "sorry, data was corrupt" (as long as it doesn't do it too often). What is not so acceptable is for the linker to provide a complete binary which contains corruption.

In the first case you want data safety - I know I can read it back if I want to. In the second you want data integrity - I know that this data is (or isn't) the data that was written.

I don't believe the OS has any role in providing integrity, beyond best-effort to save and return data as faithfully as possible. If an application really cares, the application needs to add a checksum or crypto-hash or whatever. git does this. backups do this. gzip does this. I'm sure that if the cost/benefit analysis suggested that the C compiler should do this, then it would be easy enough to add.

Improved block-layer error handling

Posted Jun 5, 2017 12:05 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

Aha. Your distinction makes sense: I was indeed conflating these, and fsync() does indeed provide safety, not integrity. Filesystems *are* increasingly providing integrity support, because disk vendors are not exactly brilliant at providing it (how many vendors seriously try not to wreck their SSDs' contents on power failure: only Intel? and even they don't on all parts).

Of course, POSIX provides no way for applications to say 'hey, fs, I want integrity from this, thank you', and it does whatever checksumming it can so the applications don't all need to reimplement it. This might make sense: it seems like something that could probably be a per-filesystem attribute, or at least a whole-directory-tree attribute or something. Of course, POSIX also provides no way to say 'hey, fs, this file was written but failed integrity checks': -EIO is, ah, likely to be misinterpreted by essentially everything. So while it would be nice to have app-level integrity checking, I doubt we can get there from here: we do need to do it invisibly, below the visible surface of the system.

Improved block-layer error handling

Posted Jun 5, 2017 18:51 UTC (Mon) by zblaxell (subscriber, #26385) [Link]

> POSIX provides no way for applications to say 'hey, fs, I want integrity from this, thank you'

Nor does it need one. POSIX should assume integrity by default unless applications say the opposite. One way applications can do that is by not checking any system call return values.

> POSIX also provides no way to say 'hey, fs, this file was written but failed integrity checks'

I don't think any changes to POSIX are required. We already have most of this in existing filesystems, just not in most existing filesystems.

In cases like compiles, where the writing application has completely disappeared before the block writes even start, there's no process to notify about the failure at the time the failure is detected. fsync() return behavior is irrelevant to this case--*every* system call, even _exit, returns before *any* error information is available. We want compiles to be fast, so we don't want to change this. A different solution is required. Note that reporting errors through fsync() is not wrong--it's just not applicable to this case.

For compiles we want to get the block-level error information passed from one producing process to another consuming process when the processes communicate through a filesystem. So let's do exactly that: If a block write fails, the filesystem should update its metadata to say "this data blocks were not written successfully and contain garbage now." Future reads of affected logical offsets of affected inodes should return EIO until the data is replaced by new (successful) writes, or the affected blocks are removed from the file by truncate, or the file is entirely deleted. If the filesystem metadata update fails too, move up the hierarchy (block > inode > subvol > planet > constellation > whatever) until the entire FS is set readonly and/or marked with errors for the next fsck to clean up by brute force.

Note that this scheme is different from block checksums. The behavior is similar, but block checksums are used to detect read errors (successful write followed by provably incorrect read), not write errors (where the write itself fails and the disk contents are unknown, possibly correct, possibly incorrect with hash collision). Checksums would not be an appropriate way to implement this. The existing prealloc mechanism in many filesystems could be extended to return EIO instead of zero data on reads. Prealloc already has most of the desired behavior wrt block allocation and overwrites.

> EIO is, ah, likely to be misinterpreted by essentially everything

I'm not sure how EIO could be misinterpreted in this context. The application is asking for data, and the filesystem is saying "you can't have that data because an IO-related failure occurred," so what part of EIO is misinterpreted exactly? What application (other than a general disk-health monitoring application, which could get detailed block error semantics through a dedicated kernel notification mechanism instead) would care about lower-level details, and which details would it use?

Also note EIO already happens in most filesystems, so we're not talking theoretically here. Most applications (even linkers), if they notice errors at all (**), notice EIO and do something sensible when they see it (*). This produces much, much more predictable results than just throwing random disk garbage into applications and betting they'll notice.

(*) interesting side note: linkers don't read all of the data in their input files, and will happily ignore EIO if it only occurs in the areas of files they don't read. Maybe not the best example case for a "data integrity" discussion. ;)

(**) for many years, GCC's as and ld didn't even notice ENOSPC, and would silently produce garbage binaries when the disk was full (maybe these would be detected by the linker later on...maybe not). Arguably we should also mark inodes with a persistent error bit if there is an ENOSPC while writing to them, but that *is* a major change which will surprise ordinary POSIX applications.

Improved block-layer error handling

Posted Aug 5, 2020 15:39 UTC (Wed) by pskocik (guest, #130865) [Link] (4 responses)

Off the top of my hat, I think an easy (?) hack that wouldn't require modifications to each application that might contribute to a possibly quite complex project build could be to add to the kernel a mechanism (syscall or an open on a special device) whereby a parent process of the project build could request to be signal-notified if there is a write error in one of the IO writes that its children (recursively) have issued, whenever the IO failure happens. Presumably the parent process could then cancel the build and remove all build products in order to prevent a corrupted build. Another syscall (or perhaps an fsync on the fd from the special device) could be used by the build supervisor process when all its children have finished (with no signal generated) to wait on any IO requests generated by its children recursively.

Improved block-layer error handling

Posted Aug 7, 2020 11:14 UTC (Fri) by flussence (guest, #85566) [Link] (2 responses)

This is an old subject but here's a proposal, which I promise I spent more than 3 minutes thinking up: an fsync cgroup controller. Add a per-cgroup setting that can suppress in-situ sync attempts, like libeatmydata does, and/or pause all timer-based disk writeback unless memory pressure dictates, like laptop_mode but not global.

When the last process in the group exits, it syncs any remaining dirty buffers touched by the process tree - in this example they'd be build artifacts, but they could be overly-paranoid software that fsyncs far too much (apt-get used to, foldingathome is awful on rotational disks), or just data that's low value to begin with (downloaded Docker containers? nosql databases?)

And once we have that in place and people using it, extending it to use filesystem-native transactions (wherever they exist) seems like an obvious next move. :-)

Improved block-layer error handling

Posted Aug 7, 2020 20:02 UTC (Fri) by Wol (subscriber, #4433) [Link] (1 responses)

> or just data that's low value to begin with (downloaded Docker containers? nosql databases?)

Why are nosql databases low value? Actually, nosql databases usually have a far higher signal/noise ratio - I converted a database from nosql to sql, and I think the size of the db DOUBLED.

Not only do nosql databases contain much more data per megabyte than relational, but they tend to be much faster - it's an old story but I remember stories about a company converting from UniVerse to (sn)Oracle, and it took SIX MONTHS for the consultants to get a Snoracle query (running on a twin Xeon) to outperform the old system running on a Pentium 90.

Or the "request for bids" put out by some University Astronomy department, that wanted a system to process 100K tpm. snOracle had to "cheat" to meet the target - delayed indexing, a couple of other tricks - while Cache had no trouble hitting 250K tpm.

RDBMSs are fundamentally inefficient, due to limitations in the relational model itself ... (just try and *store* a list in an rdbms).

Cheers,
Wol

Improved block-layer error handling

Posted Aug 13, 2020 11:08 UTC (Thu) by flussence (guest, #85566) [Link]

I think the clue there is in the name “Cache”, isn't it?

Improved block-layer error handling

Posted Sep 14, 2020 17:20 UTC (Mon) by nix (subscriber, #2304) [Link]

Yeah, that works for some cases -- but the other thing fsync does other than make sure all the fs I/O errors are in place is ensures that the thing is entirely on cold storage in case of power failure. An -EIO handler only deals with one of those problems. (In practice, you'd probably want the thing containing the -EIO handler to *also* do an fsync itself, and for the -EIO handling machinery to suppress fsyncs in its children, or something like that, so unmodified children could be run.)