Answering email about error handling in concurrent code

Someone emailed me today asking:

I’m writing because I’m somewhat conscious of what I would consider a rather large hole in the parallel programming literature.

… What if one or more of your tasks throws an exception? Should the thread that runs the task swallow it? Should the caught exceptions get stashed somewhere so that the "parent" thread can deal with them once the tasks are complete? (This is somewhat tricky currently in a language such as C++(98) where one cannot store an exception caught with the "catch(…)" construct). Perhaps all tasks should have a no-throw guarantee? Perhaps some kind of asynchronous error handlers might be installed, somewhat like POSIX signals? The options are many, but choosing a strategy is hard for those of us with little parallel programming experience.

I thought I’d share my response here:

That’s an excellent question. Someone asked that very question in Stockholm last month at my Effective Concurrency course, and my answer started out somewhat dismissive: "Well, it’s about the same as you do in sequential code, and all the same guarantees apply; nothrow/nofail is only for a few key functions used for commit/rollback operations, and you’d usually target the basic guarantee unless adding the strong guarantee comes along naturally for near-free. So it’s pretty much the same as always. Although, well, of course futures may transport exceptions across threads, but that’s still the same because they manifest on .get(). And of course for parallel loops you may get multiple concurrent exceptions from multiple concurrent loop bodies that get aggregated into a single exception; and then there’s the question of whether you start new loop bodies that haven’t started yet (usually no) but do you interrupt loop bodies that are in progress (probably not), and… oh, hmm, yeah, I guess it would be good to write an article about that."

So the above is now adding to my notes of things to write about. :-) Maybe some of that stream-of-consciousness may be helpful until I can get to writing it up in more detail.

I pointed him to Doug Lea’s Concurrent Programming in Java pages 161-176, "Dealing with Failure", adding that I haven’t read it in detail but the subtopics look right. Also Joe Duffy’s Concurrent Programming on Windows, pages 721-733.

If you know of a good standalone treatise focused on error handling in concurrent code, please mention it in the comments.

9 thoughts on “Answering email about error handling in concurrent code

  1. The same issue already arises if using asynchronous operations via the Command pattern (which doesn’t require multithreading if you’re using an event-based framework like glib or Qt). What I did in a project is passing the error objects as part of the Command’s reply.

    About catch(…): With an exception caught via … one can’t do much anyway, no matter if in the same thread or not, so my idea would be to just create some UnknownException object and pass that over to the other thread(s).

  2. Just a quick note to say thank you to Herb for posting this and all of you that have responded.

    It was my email mentioned in the post. I now have a lot more to go on, thanks to you all!

  3. Mr. Sutter/Herb: The issue you explain in your “Interrupt politely” article is exactly the one addressed in my paper: the property that I call dependency safety. Dependency safety means that if an operation B depends for its safety and correctness on the successful completion of an operation A, and A fails then B is not executed. Async exceptions generally violate dependency safety. However, my failboxes proposal addresses this, as follows: if an operation B depends on an operation A, then A and B should be executed in the same failbox. That way, if A fails, the failbox is marked as failed. The language extension enforces the property that once a failbox is marked as failed, no code executes in that failbox afterwards. So B doesn’t get executed.

    So if you Thread.stop a thread, and you have correctly applied failboxes, then all code that depends on the invariants that were broken by the Thread.stop, will be prevented from executing by the failbox mechanism. (But what you really want to do is cancel a failbox, not a thread. That way, if a thread running a mathematical computation is temporarily accessing a shared system resource, which is in its own failbox, the resource is left alone and the thread is stopped only when the thread resumes the mathematical computation.)

    I admit that the assumption that the whole program (including platform DLLs) uses failboxes correctly, is a heavy assumption. However, I believe this is probably not much harder than using locks correctly. (Which perhaps doesn’t say much, but given that locks are already present, that problem has already been solved at least in the platform DLLs.) Specifically, you probably want to associate a failbox with each lock, so that if one lock block fails, the next thread that attempts to acquire the lock gets a FailboxException and doesn’t see the corrupted state.

  4. Aaron: Actually, Joe posted that a day after I posted this. :-) The 2006 date is because it was an internal email he wrote 2.5 years ago, it is not ‘an article from Fall 2006’. A number of us were thinking about this back then, but then working out the rules and implementing them and then writing them up and teaching them does take time. Doug did a great job of it for Java, and Joe for Windows/.NET.

    Bart: Without having read your paper, I’m troubled because you said, “all other threads… get an asynchronous exception.” Did you really mean asynchronous, where the exception is externally injected into the target thread wherever it happens to be? That’s a model known to be flawed, despite that .NET and other environments support them; nobody can reliably write code in the face of async exceptions. Perhaps you meant something else? For example, if the exception can only arise in the target thread at well-known points where the thread is blocked (formally, WaitSleepJoin points), then that’s fine, but that’s not an asynchronous exception.

    See my previous Effective Concurrency article “Interrupt Politely” for the reasons why Thread.Abort (and equivalently Thread.kill in Java and pthread_kill in Posix), which interrupts a thread whatever it happens to be in the middle of doing, is fundamentally broken (unless you were intending to kill the whole process or whole machine anyway; see article for details). Asynchronous exceptions, which inject an exception into a thread whatever it happens to be in the middle of doing, are fundamentally broken for the same reasons.

  5. I have a paper on similar exception handling issues at ECOOP 2009. See http://www.cs.kuleuven.be/~bartj/failboxes/failboxes-ecoop09.pdf. It’s about a language extension, called failboxes, that’s intended to fix some issues with exception handling in the presence of non-exception-safe objects and try-catch, threads and locks, Thread.stop/Abort, and try-finally. However, it may provide some inspiration even if you do not adopt the language extension.

    The general idea is that each thread has a current failbox. When an exception occurs in a thread, its current failbox is marked as failed. All other threads running in that failbox then get an asynchronous exception. If you don’t care to split your application up into multiple units of failure, run all threads in the root failbox; in that case, any exception anywhere simply shuts down the process. If you do want fine-grained units of failure, you can build a hierarchy of failboxes.

  6. Yeah, really good question.

    I like how MT exceptions handling is done in Erlang. Its really elegant. When Actor fails and have nothing to do with the failure it just dies and reports back failure to the “parent”. Parent may be waiting for the failure or not, Actor doesn’t care.
    Such approach really allows to process many errors just in right place in the right way. The same error in parallel threads (Actors) doing the same work its just the same failure message but with different sources. Don’t think “joining” this failure messages/exceptions is something useful in all cases.

Comments are closed.