Skip to content

Conversation

@gasche
Copy link
Member

@gasche gasche commented Dec 7, 2022

See #11800 for a CI failure report and a proposed failure scenario:

  1. the main domain spawns a domain A
  2. A starts its backup thread
  3. A terminates, and in the process:
    • it sets domain_self->backup_thread_running = 0
    • it has not signalled the backup thread to terminate yet
    • it releases all_domains_lock, and is about to enter the
      delicate part that collects the rest of the state without owning
      domain_self anymore
  4. the main domain spawns a domain B, which reuses the slot of A
  5. B starts its backup thread

At this point, if my reasoning is right, B sees
domain_self->backup_thread_running == 0, but the backup thread of
A has not terminated yet, and the assumption may fail.

The present commit should fix this failure scenario, by ensuring that the backup thread is signalled before the terminating domain makes its domain-internal data available for reuse.

We don't busy-wait on the backup thread at this point (unlike in the previous iteration a8e8ff3 ), so the backup thread may still be active when a new domain tries to install it again, but in this (rare) case it will be in the BT_TERMINATE state expected by install_backup_thread which will busy-wait itself.

See ocaml#11800 for a CI failure report and a proposed failure scenario:

> 1. the main domain spawns a domain A
> 2. A starts its backup thread
> 3. A terminates, and in the process:
>    - it sets `domain_self->backup_thread_running = 0`
>    - it has not signalled the backup thread to terminate yet
>    - it releases `all_domains_lock`, and is about to enter the
>    delicate part that collects the rest of the state without owning
>    `domain_self` anymore
> 4. the main domain spawns a domain B, which reuses the slot of A
> 5. B starts its backup thread
>
> At this point, if my reasoning is right, `B` sees
> `domain_self->backup_thread_running == 0`, but the backup thread of
> A has not terminated yet, and the assumption may fail.

The present commit should fix this failure scenario, by ensuring that
the backup thread is signalled before the terminating domain makes its
domain-internal data available for reuse.

We don't busy-wait on the backup thread at this point (unlike in the
previous iteration a8e8ff3 ), so the
backup thread may still be active when a new domain tries to install
it again, but in this (rare) case it will be in the BT_TERMINATE state
expected by `install_backup_thread` which will busy-wait itself.
caml_plat_unlock(&domain_self->domain_lock);

CAMLassert (domain_self->backup_thread_running);
/* CR gasche: why is this assertion correct in the case where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it because the main OCaml thread does not go through domain_terminate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be my guess as well, thanks! But I understand that this could change in the future. Should we remove the assertion? (Or maybe this could or should become an if (domain_self->backup_thread_running) { ... }, disabling also the BT_TERMINATE signalling.) In any case, it is currently weird that the comment that is now shortly above contradicts this assertion.

(I wonder what happens if we start calling domain_terminate for the main domain as well. Do other things break, or is it supposed to work?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this assertion is useful anymore, I'm in favor of removing it at this point.

If a thread goes through domain_terminate, then a backup thread must be running, and it feels weird to check for this invariant at termination.

For Domain 0 calling domain_terminate, I am not sure this is meant to happen. Domain 0 returning means all remaining threads will be shut down on exit.

However, I see issues coming when we re-implement proper cleanup at exit, where this dual path of cleaning and exiting may come back to bite us.

(and, more, while not quite relevant, a single domain 0 does not mean it has no backup thread: it may very well be that all over domains have shut down, with domain 0 retaining its backup thread after.)

@gasche
Copy link
Member Author

gasche commented Dec 7, 2022

cc @Engil and @kayceesrk (who worked on this aspect of the code)

One aspect where this PR is lacking is on the testing front: I haven't tried to reproduce the CI failure on my machine (and I doubt it would work as-is), so I can't tell whether this really was the issue and whether this PR solves it. (In theory this was and it does, but this is the theory.)

Given that apparently this bug was already noticed during multicore-ocaml development ( #11800 (comment) ), it would be nice to think of a reliable way to have a regression test. Any suggestions?

(One experiment I can make is to introduce a large pause at the end of domain termination, after domain_self is made available for reuse, and see whether the issue triggers more easily. If it works, maybe we could introduce such a pause in debug mode, just to make this category of domain-reuse issues trigger more often?)

@abbysmal
Copy link
Contributor

abbysmal commented Dec 7, 2022

(One experiment I can make is to introduce a large pause at the end of domain termination, after domain_self is made available for reuse, and see whether the issue triggers more easily. If it works, maybe we could introduce such a pause in debug mode, just to make this category of domain-reuse issues trigger more often?)

I think this is quite a fun idea, we could introduce such delays in the debug runtime, and it could perhaps make it easier to spot lifecycle regressions with tight timing triggers?

@gasche
Copy link
Member Author

gasche commented Dec 7, 2022

I tried to add a pause in various places in domain_terminate:

  caml_plat_spin_wait(
      10*1000*1000 /* 10ms */,
      __FILE__, __LINE__, __func__);

but I didn't get any luck triggering the bug after that -- I tried to run the tests/parallel tests with the pause and the debug runtime enabled, and that was not enough (or maybe I did things wrong). That's all from me tonight :-)

@abbysmal
Copy link
Contributor

abbysmal commented Dec 7, 2022

Looking at the CI, I feel like the fix is not quite it, there's something more subtle to be taken into account and I was wrong in my assumption that it was probably ok to move BT termination here.

On the parallel/join.ml test, I can see the following behavior:
Thread 2 attempt to terminate, trigger a minor cycle.

Thread 2 received signal SIGSEGV, Segmentation fault.
caml_empty_minor_heap_promote (domain=domain@entry=0x3e5024002b70, participating_count=participating_count@entry=3, participating=participating@entry=0x4da9a0 <stw_request+64>) at runtime/minor_gc.c:511
511           intnat major_ref_size = foreign_major_ref->ptr - foreign_major_ref->base;

The foreign_domain->minor_tables being accessed is one of a recently terminated domain, as we can see by reversed stepping to it being set to 0.

Thread 8 hit Hardware watchpoint 2: *0x3e5024002bc8

Old value = 0
New value = 738207968
domain_terminate () at runtime/domain.c:1745
1745      domain_state->minor_tables = 0;
(rr) bt
#0  domain_terminate () at runtime/domain.c:1745
#1  domain_thread_func (v=<optimized out>) at runtime/domain.c:1094
#2  0x0000000000296822 in start_thread (arg=<optimized out>) at pthread_create.c:443
#3  0x0000000000236314 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

I feel like we now may be release the backup thread too early, and it is either bad, or there's another bug lurking around, but it is bed time for me.

will not have been started */
atomic_store_rel(&domain_self->backup_thread_msg, BT_TERMINATE);
caml_plat_signal(&domain_self->domain_cond);
caml_plat_unlock(&domain_self->domain_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is releasing the domain_self->domain_lock too early compared to trunk. Should we leave the unlock where it was and only move the other two lines here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would like to recreate this failure under rr and make sure that the failure is indeed due to the reason we think it is.

@gasche
Copy link
Member Author

gasche commented Dec 8, 2022

@kayceesrk I agree that releasing the domain lock is quite probably the cause of the testsuite failure here. (But then I tried to understand more precisely the protocol around the domain lock and I got a bit lost.)

@gasche
Copy link
Member Author

gasche commented Dec 8, 2022

Let me try to explain what is bugging me about domain_lock, maybe the experts in the room can let me understand better. If I read the code correctly, the part of domain_terminate that writes the BT_TERMINATE status is done while holding the domain lock, and releases the domain lock right away:

ocaml/runtime/domain.c

Lines 1753 to 1758 in 97aa649

/* signal the domain termination to the backup thread
NB: for a program with no additional domains, the backup thread
will not have been started */
atomic_store_rel(&domain_self->backup_thread_msg, BT_TERMINATE);
caml_plat_signal(&domain_self->domain_cond);
caml_plat_unlock(&domain_self->domain_lock);

On the other hand, create_domain takes the domain lock:

ocaml/runtime/domain.c

Lines 583 to 588 in 97aa649

s->unique_id = fresh_domain_unique_id();
s->interrupt_word = &domain_state->young_limit;
s->running = 1;
atomic_fetch_add(&caml_num_domains_running, 1);
caml_plat_lock(&d->domain_lock);

and the call to install_backup_thread only occurs after this domain_create call has taken the lock:

ocaml/runtime/domain.c

Lines 1081 to 1082 in 97aa649

if (domain_self) {
install_backup_thread(domain_self);

Naively this suggests that the race scenario I described above, when creating this PR, cannot actually happen: a new domain reusing the same dom_internal slot could only install its backup thread after the previous domain termination has released the domain lock, and this occurs after BT_TERMINATE has been written, so at this point the install_backup_thread assertion failure is not possible anymore.

But: there are other uses of the domain lock, and it may be that something funny happens. (Could something in domain_terminate release and re-acquire the domain lock after leaving the STW set but before writing the BT_TERMINATE status?)

I must say that it's not so easy to follow the use of the various locking mechanisms in domain.c and the invariants they should protect. For example the backup thread sometimes waits on an interruptor-related condition variable s->cond, sometimes on domain_cond, and it is not obvious why or what assumptions are made. Independently, the part of domain_terminate after removal from the STW set seems careful not to access domain_self, only domain_state... except precisely for the three lines that deal with the backup thread.


Side question: online documentation on condition variable suggests to use a while (cond-is-false) pthread_cond_wait(cond) approach, to avoid spurious wakeups. I wonder why the backup thread code uses a simple if (cond is false) pthread_cond_wait(cond) instead, what happens in the case of spurious wakeups?

(I'm at the point where I realize that I don't understand much of what is going on here, and I am thinking of closing the PR to go back to the issue discussing the assert failure.)

@kayceesrk
Copy link
Contributor

(I'm at the point where I realize that I don't understand much of what is going on here, and I am thinking of closing the PR to go back to the issue discussing the assert failure.)

I am happy to take a look at clarifying/reorganizing the runtime code around domain termination for readability. This can be done elsewhere. For this issue, I do think that we should understand and reproduce the assertion failure before we can propose a solution. So closing this PR is the right thing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants