Skip to content

Comments

crypto: thread: serialize concurrent joins#19433

Closed
ckalina wants to merge 2 commits intoopenssl:masterfrom
ckalina:mr_thread_mjoin_v1
Closed

crypto: thread: serialize concurrent joins#19433
ckalina wants to merge 2 commits intoopenssl:masterfrom
ckalina:mr_thread_mjoin_v1

Conversation

@ckalina
Copy link
Contributor

@ckalina ckalina commented Oct 18, 2022

Multiple concurrent joins with a running thread suffer from a race condition that allows concurrent join calls to perform concurrent arch specific join calls, which is UB on POSIX, or to concurrently execute join and terminate calls.

As soon as a thread T1 exists, one of the threads that joins with T1 is selected to perform the join, the remaining ones await completion. Once completed, the remaining calls immediately return. If the join failed, another thread is selected to attempt the join operation.

Forcefully terminating a thread that is in the process of joining another thread is not supported.

Common code from thread_posix and thread_win was refactored to use common wrapper that handles synchronization.

This should fix the timeout encountered while executing threads test (likely due to test_thread_native_multiple_joins).

Checklist
  • tests are added or updated

Multiple concurrent joins with a running thread suffer from a race
condition that allows concurrent join calls to perform concurrent arch
specific join calls, which is UB on POSIX, or to concurrently execute
join and terminate calls.

As soon as a thread T1 exists, one of the threads that joins with T1
is selected to perform the join, the remaining ones await completion.
Once completed, the remaining calls immediately return. If the join
failed, another thread is selected to attempt the join operation.

Forcefully terminating a thread that is in the process of joining
another thread is not supported.

Common code from thread_posix and thread_win was refactored to use
common wrapper that handles synchronization.

Signed-off-by: Čestmír Kalina <[email protected]>
@github-actions github-actions bot added the severity: fips change The pull request changes FIPS provider sources label Oct 18, 2022
@t8m
Copy link
Member

t8m commented Oct 18, 2022

Unfortunately this did not seem to fix the MacOSX hang.

@t8m t8m added branch: master Applies to master branch triaged: bug The issue/pr is/fixes a bug severity: urgent Fixes an urgent issue (exempt from 24h grace period) labels Oct 19, 2022
While POSIX threads are cancellable and may be asynchronously cancelled,
their cancellation is not guaranteed by the POSIX standard.

test_thread_noreturn, which simulates a long-running possibly
unresponsive thread:

	THREAD openssl#1		THREAD openssl#2
	LOCK L1
	SPAWN openssl#2
				LOCK L1

On MacOS, cancelling such thread only queues cancellation request, but
the following pthread_join hangs.

Replace this implementation by an unbounded sequence of sleeps instead.

Signed-off-by: Čestmír Kalina <[email protected]>
@ckalina
Copy link
Contributor Author

ckalina commented Oct 21, 2022

The MacOS hang was indeed independent of this and should now hopefully be fixed.

While POSIX threads are explicitly set cancellable and may be asynchronously cancelled, their cancellation is not guaranteed by the POSIX standard.

test_thread_noreturn, which simulates a long-running possibly unresponsive thread:

	THREAD T1	THREAD T2
-------------------------------------------------------------------------------
	LOCK L1
	SPAWN T2
				LOCK L1

On MacOS, cancelling such thread only queues cancellation request, but the following pthread_join hangs.

Replace this implementation by an unbounded sequence of sleeps instead. One second sleeps are chosen, but anything smaller goes too. I'd prefer not to use larger values, though, to ensure that it's woken up often enough.

@ckalina ckalina marked this pull request as ready for review October 21, 2022 09:43
@t8m t8m added approval: review pending This pull request needs review by a committer approval: otc review pending labels Oct 21, 2022
@t8m
Copy link
Member

t8m commented Oct 21, 2022

This is urgent. The buildbot macosx failure is unrelated.

@hlandau hlandau added approval: done This pull request has the required number of approvals and removed approval: review pending This pull request needs review by a committer labels Oct 21, 2022
@hlandau
Copy link
Member

hlandau commented Oct 21, 2022

Agree urgent

@t8m t8m added approval: ready to merge The 24 hour grace period has passed, ready to merge and removed approval: done This pull request has the required number of approvals labels Oct 21, 2022
@t8m
Copy link
Member

t8m commented Oct 21, 2022

I'll wait for the windows buildbot builds to finish and merge afterwards.

@t8m
Copy link
Member

t8m commented Oct 21, 2022

For some reason on Windows the test still hangs when running on buildbot. I'll merge this anyway to unblock at least the macosx builds and we'll continue in another PR.

@t8m
Copy link
Member

t8m commented Oct 21, 2022

Merged to master branch. Thank you. Assuming a PR fixing the Windows buildbot hang will follow.

@t8m t8m closed this Oct 21, 2022
openssl-machine pushed a commit that referenced this pull request Oct 21, 2022
Multiple concurrent joins with a running thread suffer from a race
condition that allows concurrent join calls to perform concurrent arch
specific join calls, which is UB on POSIX, or to concurrently execute
join and terminate calls.

As soon as a thread T1 exists, one of the threads that joins with T1
is selected to perform the join, the remaining ones await completion.
Once completed, the remaining calls immediately return. If the join
failed, another thread is selected to attempt the join operation.

Forcefully terminating a thread that is in the process of joining
another thread is not supported.

Common code from thread_posix and thread_win was refactored to use
common wrapper that handles synchronization.

Signed-off-by: Čestmír Kalina <[email protected]>

Reviewed-by: Hugo Landau <[email protected]>
Reviewed-by: Tomas Mraz <[email protected]>
(Merged from #19433)
openssl-machine pushed a commit that referenced this pull request Oct 21, 2022
While POSIX threads are cancellable and may be asynchronously cancelled,
their cancellation is not guaranteed by the POSIX standard.

test_thread_noreturn, which simulates a long-running possibly
unresponsive thread:

	THREAD #1		THREAD #2
	LOCK L1
	SPAWN #2
				LOCK L1

On MacOS, cancelling such thread only queues cancellation request, but
the following pthread_join hangs.

Replace this implementation by an unbounded sequence of sleeps instead.

Signed-off-by: Čestmír Kalina <[email protected]>

Reviewed-by: Hugo Landau <[email protected]>
Reviewed-by: Tomas Mraz <[email protected]>
(Merged from #19433)
@mouse07410
Copy link
Contributor

I confirm that this PR fixed the problem on MacOS.

beldmit pushed a commit to beldmit/openssl that referenced this pull request Dec 26, 2022
Multiple concurrent joins with a running thread suffer from a race
condition that allows concurrent join calls to perform concurrent arch
specific join calls, which is UB on POSIX, or to concurrently execute
join and terminate calls.

As soon as a thread T1 exists, one of the threads that joins with T1
is selected to perform the join, the remaining ones await completion.
Once completed, the remaining calls immediately return. If the join
failed, another thread is selected to attempt the join operation.

Forcefully terminating a thread that is in the process of joining
another thread is not supported.

Common code from thread_posix and thread_win was refactored to use
common wrapper that handles synchronization.

Signed-off-by: Čestmír Kalina <[email protected]>

Reviewed-by: Hugo Landau <[email protected]>
Reviewed-by: Tomas Mraz <[email protected]>
(Merged from openssl#19433)
beldmit pushed a commit to beldmit/openssl that referenced this pull request Dec 26, 2022
While POSIX threads are cancellable and may be asynchronously cancelled,
their cancellation is not guaranteed by the POSIX standard.

test_thread_noreturn, which simulates a long-running possibly
unresponsive thread:

	THREAD #1		THREAD #2
	LOCK L1
	SPAWN #2
				LOCK L1

On MacOS, cancelling such thread only queues cancellation request, but
the following pthread_join hangs.

Replace this implementation by an unbounded sequence of sleeps instead.

Signed-off-by: Čestmír Kalina <[email protected]>

Reviewed-by: Hugo Landau <[email protected]>
Reviewed-by: Tomas Mraz <[email protected]>
(Merged from openssl#19433)
@ckalina ckalina deleted the mr_thread_mjoin_v1 branch August 12, 2023 15:52
ckalina added a commit to ckalina/openssl that referenced this pull request Aug 12, 2023
Multiple concurrent joins with a running thread suffer from a race
condition that allows concurrent join calls to perform concurrent arch
specific join calls, which is UB on POSIX, or to concurrently execute
join and terminate calls.

As soon as a thread T1 exists, one of the threads that joins with T1
is selected to perform the join, the remaining ones await completion.
Once completed, the remaining calls immediately return. If the join
failed, another thread is selected to attempt the join operation.

Forcefully terminating a thread that is in the process of joining
another thread is not supported.

Common code from thread_posix and thread_win was refactored to use
common wrapper that handles synchronization.

Signed-off-by: Čestmír Kalina <[email protected]>

Reviewed-by: Hugo Landau <[email protected]>
Reviewed-by: Tomas Mraz <[email protected]>
(Merged from openssl#19433)
ckalina added a commit to ckalina/openssl that referenced this pull request Aug 12, 2023
While POSIX threads are cancellable and may be asynchronously cancelled,
their cancellation is not guaranteed by the POSIX standard.

test_thread_noreturn, which simulates a long-running possibly
unresponsive thread:

	THREAD openssl#1		THREAD openssl#2
	LOCK L1
	SPAWN openssl#2
				LOCK L1

On MacOS, cancelling such thread only queues cancellation request, but
the following pthread_join hangs.

Replace this implementation by an unbounded sequence of sleeps instead.

Signed-off-by: Čestmír Kalina <[email protected]>

Reviewed-by: Hugo Landau <[email protected]>
Reviewed-by: Tomas Mraz <[email protected]>
(Merged from openssl#19433)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval: ready to merge The 24 hour grace period has passed, ready to merge branch: master Applies to master branch severity: fips change The pull request changes FIPS provider sources severity: urgent Fixes an urgent issue (exempt from 24h grace period) triaged: bug The issue/pr is/fixes a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants