Skip to content

Comments

Implement new placement strategy: 'limit-active-tasks'#4118

Merged
nader-ziada merged 4 commits intoconcourse:masterfrom
Pix4D:active_tasks_lock_task_step
Jul 19, 2019
Merged

Implement new placement strategy: 'limit-active-tasks'#4118
nader-ziada merged 4 commits intoconcourse:masterfrom
Pix4D:active_tasks_lock_task_step

Conversation

@aledegano
Copy link
Contributor

This PR proposes the addition of a new placement strategy. It introduces the concept of active task, that is a task that is effectively running on a worker (as opposed, for instance, to build containers that can stay on the worker long after the task is finished).

A new "active_task" counter is added to the DB Worker and the responsibility of increasing/decreasing the counter is in the task_step.
The placement strategy considers the number of active tasks in each compatible worker and assigns any step to the worker with the fewest of them. Additionally a parameter "MaxActiveTasksPerWorker" can be defined, in that case workers that already have that number of active tasks will not be selected for tasks placement. If no worker is available with less than MaxActiveTasksPerWorker then the tasks will simply wait until one is free.
The worker selection in the task_step is serialized through a lock to prevent races where different tasks could land on the same worker.
Note that any other "step" is not restrained by MaxActiveTasksPerWorker and will simply chose the worker with fewest active tasks (thus put, get, check etc will never be blocked).

The PR is split in 3 commits to ease review:

  • 0a73380 is the actual code only implementing the strategy and the MaxActiveTasksPerWorker parameter
  • 887aa80 regenerates the fakes and adapt the existing tests only
  • 97138e0 implements new unit tests to cover the new code

This PR supersedes #4076 and, hopefully, implements all the suggestions proposed in that discussion.

@ddadlani

@aledegano aledegano force-pushed the active_tasks_lock_task_step branch 2 times, most recently from b10e607 to 4ef6c31 Compare July 12, 2019 10:35
@aledegano aledegano changed the title Implement new placement strategy: 'fewest-avtive-tasks' Implement new placement strategy: 'fewest-active-tasks' Jul 12, 2019
This was referenced Jul 16, 2019
@kcmannem
Copy link
Contributor

I like the change. Thanks for the help 👍

@kcmannem
Copy link
Contributor

idk how everyone feels about this but it might be confusing to differentiate fewest-build-containers and fewest-active-tasks at a glance. Should we go with a name like limit-active-tasks

@aledegano
Copy link
Contributor Author

idk how everyone feels about this but it might be confusing to differentiate fewest-build-containers and fewest-active-tasks at a glance. Should we go with a name like limit-active-tasks

I don't have any strong opinion on the naming, just keep in mind that MaxActiveTasksPerWorker can be set to 0 in which case there's no limit on the active tasks, but the worker with the fewest of them is picked.
In this sense it is somewhat similar to fewest-build-containers.
I'll be happy to rename it to anything we're happy about.

@aledegano
Copy link
Contributor Author

idk how everyone feels about this but it might be confusing to differentiate fewest-build-containers and fewest-active-tasks at a glance. Should we go with a name like limit-active-tasks

I don't have any strong opinion on the naming, just keep in mind that MaxActiveTasksPerWorker can be set to 0 in which case there's no limit on the active tasks, but the worker with the fewest of them is picked.
In this sense it is somewhat similar to fewest-build-containers.
I'll be happy to rename it to anything we're happy about.

In any case I've followed your suggestion and the strategy is now called limit-active-tasks

@aledegano aledegano force-pushed the active_tasks_lock_task_step branch from 1966878 to e396d1b Compare July 17, 2019 09:04
@aledegano aledegano changed the title Implement new placement strategy: 'fewest-active-tasks' Implement new placement strategy: 'limit-active-tasks' Jul 17, 2019
Copy link
Contributor

@xtreme-sameer-vohra xtreme-sameer-vohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aledeganopix4d
Looks good. We added a few comments.
There are some scenarios where the decrementCounter might not be executed (ie. the ATC is restarted ). In these scenarios, the worker would have to be retired and registered to reset the counter. It would be worth documenting this so users are aware of these edge cases.

defaultLimits atc.ContainerLimits
strategy worker.ContainerPlacementStrategy
resourceFactory resource.ResourceFactory
lockFactory lock.LockFactory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this lockFactory isn't used as StepFactory.TaskStep also has a lockFactory in the signature
It would be preferable to not modify the interface and utilize the lockFactory in the stepFactory struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted, thanks!
I've now removed it!

}

if step.strategy.ModifiesActiveTasks() {
if chosenWorker == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we differentiate the 2 potential conditions;

  • there are no workers in the pool
  • there workers in the pool are busy

In the first case, we might want to bubble the error up to the user
In the latter case, Concourse can wait until a worker is free to take on the work

And lastly, is there a max time we would want Concourse to keep trying before giving up and bubbling the error up ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the first comment:
That's already been taken care of as is because pool.allSatisfying already returns an error when the pool is empty, so that behavior is unchanged.

About the timeout: I've thought about it. If we are to implement it should be customizable and maybe that's the right approach eventually, but for now I really wanted to keep the PR as simple as possible. We can refine the feature incrementally afterward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @aledeganopix4d

Alessandro Degano added 4 commits July 18, 2019 08:58
Add active_tasks counter to db Worker.
Add db migrations to add the new column.
Parametrize max-build-tasks-per-worker from command line argument.
Create TaskStep lock.
Pass lockfactory to task step.
Add locking around task worker chosing.

Signed-off-by: Alessandro Degano <[email protected]>
No new tests yet introduced.

Signed-off-by: Alessandro Degano <[email protected]>
- placement: FewestActiveTasksPlacementStrategy
- task_step: Increase and Decrease active tasks in fakeWorker
- db.Worker: Increase and Decrease active tasks in DB

Signed-off-by: Alessandro Degano <[email protected]>
Rename ModifyActiveTasks() to ModifiesActiveTasks().

Signed-off-by: Alessandro Degano <[email protected]>
@aledegano aledegano force-pushed the active_tasks_lock_task_step branch from e396d1b to ab15b63 Compare July 18, 2019 06:58
@aledegano
Copy link
Contributor Author

Hey @aledeganopix4d
Looks good. We added a few comments.
There are some scenarios where the decrementCounter might not be executed (ie. the ATC is restarted ). In these scenarios, the worker would have to be retired and registered to reset the counter. It would be worth documenting this so users are aware of these edge cases.

Sure, good idea, would you know where would the best place be to document this?

@xtreme-sameer-vohra
Copy link
Contributor

Hey @aledeganopix4d
Looks good. We added a few comments.
There are some scenarios where the decrementCounter might not be executed (ie. the ATC is restarted ). In these scenarios, the worker would have to be retired and registered to reset the counter. It would be worth documenting this so users are aware of these edge cases.

Sure, good idea, would you know where would the best place be to document this?

Hey @aledeganopix4d
We can document the ContainerPlacementStrategy->limit-active-tasks & MaxActiveTasksPerWorker flags as being experimental in atc/atccmd/command.go

And if you're so kind, you can submit a PR for https://github.com/concourse/docs/blob/master/lit/docs/operation/container-placement.lit. Here you can leverage the \warn tag to make a note that this is experimental

Thanks :)

@aledegano
Copy link
Contributor Author

Hey @aledeganopix4d
Looks good. We added a few comments.
There are some scenarios where the decrementCounter might not be executed (ie. the ATC is restarted ). In these scenarios, the worker would have to be retired and registered to reset the counter. It would be worth documenting this so users are aware of these edge cases.

Sure, good idea, would you know where would the best place be to document this?

Hey @aledeganopix4d
We can document the ContainerPlacementStrategy->limit-active-tasks & MaxActiveTasksPerWorker flags as being experimental in atc/atccmd/command.go

And if you're so kind, you can submit a PR for https://github.com/concourse/docs/blob/master/lit/docs/operation/container-placement.lit. Here you can leverage the \warn tag to make a note that this is experimental

Thanks :)

Hello @xtreme-sameer-vohra,
I've added the Experimental notice in the command line description and then opened a PR in the docs describing this new placement strategy here: concourse/docs#231.
Thanks.

Copy link
Contributor

@nader-ziada nader-ziada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aledeganopix4d for the PR and the followup changes!

@nader-ziada nader-ziada merged commit aab8f4a into concourse:master Jul 19, 2019
@aledegano aledegano deleted the active_tasks_lock_task_step branch July 19, 2019 14:39
aledegano pushed a commit to Pix4D/concourse that referenced this pull request Jul 22, 2019
This is a follow-up of concourse#4118 which introduced the new placement strategy: `limit-active-tasks`.

To improve the user experience this PR outputs to the UI when the task is waiting for a worker
to be free, that is warning the user that the system is at full capacity and the task will wait
for a worker to free up.

Once a worker is available and the task starts the interface informs the user that the task can
start and how long it has been waiting.

Signed-off-by: Alessandro Degano <[email protected]>
aledegano pushed a commit to Pix4D/concourse that referenced this pull request Jul 23, 2019
This is a follow-up of concourse#4118 which introduced the new placement strategy: `limit-active-tasks`.

To improve the user experience this PR outputs to the UI when the task is waiting for a worker
to be free, that is warning the user that the system is at full capacity and the task will wait
for a worker to free up.

Once a worker is available and the task starts the interface informs the user that the task can
start and how long it has been waiting.

Signed-off-by: Alessandro Degano <[email protected]>
aledegano pushed a commit to Pix4D/concourse that referenced this pull request Jul 23, 2019
This is a follow-up of concourse#4118 which introduced the new placement strategy: `limit-active-tasks`.

To improve the user experience this PR outputs to the UI when the task is waiting for a worker
to be free, that is warning the user that the system is at full capacity and the task will wait
for a worker to free up.

Once a worker is available and the task starts the interface informs the user that the task can
start and how long it has been waiting.

Signed-off-by: Alessandro Degano <[email protected]>
ddadlani pushed a commit that referenced this pull request Jul 23, 2019
Fix conflicts caused by PR #4118 - addition of `limit-task-step`
container placement strategy
@jamieklassen jamieklassen added the release/documented Documentation and release notes have been updated. label Aug 26, 2019
jamieklassen pushed a commit that referenced this pull request Aug 26, 2019
#4118
#4148
#4208
#4277
#4142
#4221
#4293

Signed-off-by: James Thomson <[email protected]>
Co-authored-by: Jamie Klassen <[email protected]>
@gerhard
Copy link

gerhard commented Sep 3, 2019

Really looking forward to v5.5 shipping with this feature. We've just hit a new wall of failed tests due to CPU contention, even though we have massively over-provisioned workers. It's either baby-sitting our pipelines so that builds go through, or waiting days for builds to go through. 🚢 :shipit:

image

matthewpereira pushed a commit that referenced this pull request Sep 5, 2019
#4118
#4148
#4208
#4277
#4142
#4221
#4293

Signed-off-by: James Thomson <[email protected]>
Co-authored-by: Jamie Klassen <[email protected]>
k8s-ci-robot pushed a commit to helm/charts that referenced this pull request Sep 6, 2019
This commit adds the new parameters that were added to Concourse 5.5.

Here's a breakdown of the new parameters:

- max-active-tasks-per-worker

  > used by the `limit-active-tasks` container placement strategy
  > concourse/concourse#4118

- support for influxdb batching and bigger buffer size for metrics emissions

  > concourse/concourse#3937

- limitting number of max connections in db conn pools

  > concourse/concourse#4232

Signed-off-by: Ciro S. Costa <[email protected]>
Co-authored-by: Zoe Tian <[email protected]>
kengou pushed a commit to kengou/charts that referenced this pull request Sep 18, 2019
This commit adds the new parameters that were added to Concourse 5.5.

Here's a breakdown of the new parameters:

- max-active-tasks-per-worker

  > used by the `limit-active-tasks` container placement strategy
  > concourse/concourse#4118

- support for influxdb batching and bigger buffer size for metrics emissions

  > concourse/concourse#3937

- limitting number of max connections in db conn pools

  > concourse/concourse#4232

Signed-off-by: Ciro S. Costa <[email protected]>
Co-authored-by: Zoe Tian <[email protected]>
ramkumarvs pushed a commit to yugabyte/charts-helm-fork that referenced this pull request Sep 30, 2019
This commit adds the new parameters that were added to Concourse 5.5.

Here's a breakdown of the new parameters:

- max-active-tasks-per-worker

  > used by the `limit-active-tasks` container placement strategy
  > concourse/concourse#4118

- support for influxdb batching and bigger buffer size for metrics emissions

  > concourse/concourse#3937

- limitting number of max connections in db conn pools

  > concourse/concourse#4232

Signed-off-by: Ciro S. Costa <[email protected]>
Co-authored-by: Zoe Tian <[email protected]>
taylorsilva pushed a commit to taylorsilva/concourse-helm that referenced this pull request Oct 2, 2019
This commit adds the new parameters that were added to Concourse 5.5.

Here's a breakdown of the new parameters:

- max-active-tasks-per-worker

  > used by the `limit-active-tasks` container placement strategy
  > concourse/concourse#4118

- support for influxdb batching and bigger buffer size for metrics emissions

  > concourse/concourse#3937

- limitting number of max connections in db conn pools

  > concourse/concourse#4232

Signed-off-by: Ciro S. Costa <[email protected]>
Co-authored-by: Zoe Tian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release/documented Documentation and release notes have been updated.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants