Skip to content

Comments

Do not acquire lock when decreasing task count.#4277

Merged
kcmannem merged 1 commit intoconcourse:masterfrom
Pix4D:no_locks_decrease_tasks
Aug 22, 2019
Merged

Do not acquire lock when decreasing task count.#4277
kcmannem merged 1 commit intoconcourse:masterfrom
Pix4D:no_locks_decrease_tasks

Conversation

@aledegano
Copy link
Contributor

The tasks count decreasing (used in the limit-active-tasks strategy) can be
executed concourrently as the DB transaction will make sure that the end
value is consistent.

Trying to acquire a lock when decreasing the tasks count in a system with
several flying task can easily result in a race to acquire lock and eventually
even a "de facto" dead lock.

We observed this issue on our Production CI where it eventually all just hanged, what happens is the following:

  • New tasks acquire the lock to check if a worker is free and all are busy
  • Completed tasks try to acquire a lock to decrease the task and free the worker
  • More tasks enter the queue and try to acquire a lock
  • The decrease-active-tasks is can perform its operation once in a blue moon making the situation worse and worse

@deniseyu
@ddadlani

@aledegano aledegano requested a review from a team August 14, 2019 09:08
The tasks count decreasing (used in the limit-active-tasks strategy) can be
executed concourrently as the DB transaction will make sure that the end
value is consistent.

Trying to acquire a lock when decreasing the tasks count in a system with
several flying task can easily result in a race to acquire lock and eventually
even a "de facto" dead lock.

Signed-off-by: Alessandro Degano <[email protected]>
@aledegano aledegano force-pushed the no_locks_decrease_tasks branch from 4cd63f6 to 2047500 Compare August 14, 2019 09:35
@ddadlani
Copy link
Contributor

@aledeganopix4d interesting! I apologize for not answering your question about this in another thread earlier - I filed it away to respond to later and it completely slipped my mind.

We did not consider potential lock contention. The reason we added the lock was that in the case that between reading the number of tasks and actually selecting a worker, the number of tasks on a given worker might decrease. This is not a huge deal since in the next loop iteration the worker would be selected, but we were aiming for correctness.

If it truly is causing this much contention then I agree with you, we can just rely on DB locking as the failure case is not that much of a problem.

@aledegano
Copy link
Contributor Author

@ddadlani sorry for nagging: do you think we can merge this one?
We're eager to try it in prod!

@kcmannem kcmannem merged commit a670997 into concourse:master Aug 22, 2019
@jamieklassen jamieklassen added the release/undocumented This didn't warrant being documented or put in release notes. label Aug 26, 2019
jamieklassen pushed a commit that referenced this pull request Aug 26, 2019
#4118
#4148
#4208
#4277
#4142
#4221
#4293

Signed-off-by: James Thomson <[email protected]>
Co-authored-by: Jamie Klassen <[email protected]>
matthewpereira pushed a commit that referenced this pull request Sep 5, 2019
#4118
#4148
#4208
#4277
#4142
#4221
#4293

Signed-off-by: James Thomson <[email protected]>
Co-authored-by: Jamie Klassen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release/undocumented This didn't warrant being documented or put in release notes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants