Implement new placement strategy: 'limit-active-tasks' by aledegano · Pull Request #4118 · concourse/concourse

aledegano · 2019-07-12T09:30:19Z

This PR proposes the addition of a new placement strategy. It introduces the concept of active task, that is a task that is effectively running on a worker (as opposed, for instance, to build containers that can stay on the worker long after the task is finished).

A new "active_task" counter is added to the DB Worker and the responsibility of increasing/decreasing the counter is in the task_step.
The placement strategy considers the number of active tasks in each compatible worker and assigns any step to the worker with the fewest of them. Additionally a parameter "MaxActiveTasksPerWorker" can be defined, in that case workers that already have that number of active tasks will not be selected for tasks placement. If no worker is available with less than MaxActiveTasksPerWorker then the tasks will simply wait until one is free.
The worker selection in the task_step is serialized through a lock to prevent races where different tasks could land on the same worker.
Note that any other "step" is not restrained by MaxActiveTasksPerWorker and will simply chose the worker with fewest active tasks (thus put, get, check etc will never be blocked).

The PR is split in 3 commits to ease review:

0a73380 is the actual code only implementing the strategy and the MaxActiveTasksPerWorker parameter
887aa80 regenerates the fakes and adapt the existing tests only
97138e0 implements new unit tests to cover the new code

This PR supersedes #4076 and, hopefully, implements all the suggestions proposed in that discussion.

@ddadlani

atc/exec/task_step.go

kcmannem · 2019-07-16T15:58:31Z

I like the change. Thanks for the help 👍

kcmannem · 2019-07-16T17:07:27Z

idk how everyone feels about this but it might be confusing to differentiate fewest-build-containers and fewest-active-tasks at a glance. Should we go with a name like limit-active-tasks

aledegano · 2019-07-17T07:35:05Z

idk how everyone feels about this but it might be confusing to differentiate fewest-build-containers and fewest-active-tasks at a glance. Should we go with a name like limit-active-tasks

I don't have any strong opinion on the naming, just keep in mind that MaxActiveTasksPerWorker can be set to 0 in which case there's no limit on the active tasks, but the worker with the fewest of them is picked.
In this sense it is somewhat similar to fewest-build-containers.
I'll be happy to rename it to anything we're happy about.

aledegano · 2019-07-17T09:03:43Z

idk how everyone feels about this but it might be confusing to differentiate fewest-build-containers and fewest-active-tasks at a glance. Should we go with a name like limit-active-tasks

I don't have any strong opinion on the naming, just keep in mind that MaxActiveTasksPerWorker can be set to 0 in which case there's no limit on the active tasks, but the worker with the fewest of them is picked.
In this sense it is somewhat similar to fewest-build-containers.
I'll be happy to rename it to anything we're happy about.

In any case I've followed your suggestion and the strategy is now called limit-active-tasks

xtreme-sameer-vohra

Hey @aledeganopix4d
Looks good. We added a few comments.
There are some scenarios where the decrementCounter might not be executed (ie. the ATC is restarted ). In these scenarios, the worker would have to be retired and registered to reset the counter. It would be worth documenting this so users are aware of these edge cases.

xtreme-sameer-vohra · 2019-07-17T14:20:25Z

atc/engine/builder/step_factory.go

 	defaultLimits         atc.ContainerLimits
 	strategy              worker.ContainerPlacementStrategy
 	resourceFactory       resource.ResourceFactory
+	lockFactory           lock.LockFactory


It seems like this lockFactory isn't used as StepFactory.TaskStep also has a lockFactory in the signature
It would be preferable to not modify the interface and utilize the lockFactory in the stepFactory struct

Well spotted, thanks!
I've now removed it!

xtreme-sameer-vohra · 2019-07-17T14:32:16Z

atc/exec/task_step.go

+		}
+
+		if step.strategy.ModifiesActiveTasks() {
+			if chosenWorker == nil {


Should we differentiate the 2 potential conditions;

there are no workers in the pool

there workers in the pool are busy

In the first case, we might want to bubble the error up to the user
In the latter case, Concourse can wait until a worker is free to take on the work

And lastly, is there a max time we would want Concourse to keep trying before giving up and bubbling the error up ?

About the first comment:
That's already been taken care of as is because pool.allSatisfying already returns an error when the pool is empty, so that behavior is unchanged.

About the timeout: I've thought about it. If we are to implement it should be customizable and maybe that's the right approach eventually, but for now I really wanted to keep the PR as simple as possible. We can refine the feature incrementally afterward.

Thx @aledeganopix4d

Add active_tasks counter to db Worker. Add db migrations to add the new column. Parametrize max-build-tasks-per-worker from command line argument. Create TaskStep lock. Pass lockfactory to task step. Add locking around task worker chosing. Signed-off-by: Alessandro Degano <[email protected]>

No new tests yet introduced. Signed-off-by: Alessandro Degano <[email protected]>

- placement: FewestActiveTasksPlacementStrategy - task_step: Increase and Decrease active tasks in fakeWorker - db.Worker: Increase and Decrease active tasks in DB Signed-off-by: Alessandro Degano <[email protected]>

Rename ModifyActiveTasks() to ModifiesActiveTasks(). Signed-off-by: Alessandro Degano <[email protected]>

aledegano · 2019-07-18T07:34:22Z

Hey @aledeganopix4d
Looks good. We added a few comments.
There are some scenarios where the decrementCounter might not be executed (ie. the ATC is restarted ). In these scenarios, the worker would have to be retired and registered to reset the counter. It would be worth documenting this so users are aware of these edge cases.

Sure, good idea, would you know where would the best place be to document this?

xtreme-sameer-vohra · 2019-07-18T14:50:23Z

Hey @aledeganopix4d
Looks good. We added a few comments.
There are some scenarios where the decrementCounter might not be executed (ie. the ATC is restarted ). In these scenarios, the worker would have to be retired and registered to reset the counter. It would be worth documenting this so users are aware of these edge cases.

Sure, good idea, would you know where would the best place be to document this?

Hey @aledeganopix4d
We can document the ContainerPlacementStrategy->limit-active-tasks & MaxActiveTasksPerWorker flags as being experimental in atc/atccmd/command.go

And if you're so kind, you can submit a PR for https://github.com/concourse/docs/blob/master/lit/docs/operation/container-placement.lit. Here you can leverage the \warn tag to make a note that this is experimental

Thanks :)

aledegano · 2019-07-19T09:25:38Z

Hey @aledeganopix4d
Looks good. We added a few comments.
There are some scenarios where the decrementCounter might not be executed (ie. the ATC is restarted ). In these scenarios, the worker would have to be retired and registered to reset the counter. It would be worth documenting this so users are aware of these edge cases.

Sure, good idea, would you know where would the best place be to document this?

Hey @aledeganopix4d
We can document the ContainerPlacementStrategy->limit-active-tasks & MaxActiveTasksPerWorker flags as being experimental in atc/atccmd/command.go

And if you're so kind, you can submit a PR for https://github.com/concourse/docs/blob/master/lit/docs/operation/container-placement.lit. Here you can leverage the \warn tag to make a note that this is experimental

Thanks :)

Hello @xtreme-sameer-vohra,
I've added the Experimental notice in the command line description and then opened a PR in the docs describing this new placement strategy here: concourse/docs#231.
Thanks.

nader-ziada

Thanks @aledeganopix4d for the PR and the followup changes!

This is a follow-up of concourse#4118 which introduced the new placement strategy: `limit-active-tasks`. To improve the user experience this PR outputs to the UI when the task is waiting for a worker to be free, that is warning the user that the system is at full capacity and the task will wait for a worker to free up. Once a worker is available and the task starts the interface informs the user that the task can start and how long it has been waiting. Signed-off-by: Alessandro Degano <[email protected]>

Fix conflicts caused by PR #4118 - addition of `limit-task-step` container placement strategy

#4118 #4148 #4208 #4277 #4142 #4221 #4293 Signed-off-by: James Thomson <[email protected]> Co-authored-by: Jamie Klassen <[email protected]>

gerhard · 2019-09-03T11:19:43Z

Really looking forward to v5.5 shipping with this feature. We've just hit a new wall of failed tests due to CPU contention, even though we have massively over-provisioned workers. It's either baby-sitting our pipelines so that builds go through, or waiting days for builds to go through. 🚢

#4118 #4148 #4208 #4277 #4142 #4221 #4293 Signed-off-by: James Thomson <[email protected]> Co-authored-by: Jamie Klassen <[email protected]>

This commit adds the new parameters that were added to Concourse 5.5. Here's a breakdown of the new parameters: - max-active-tasks-per-worker > used by the `limit-active-tasks` container placement strategy > concourse/concourse#4118 - support for influxdb batching and bigger buffer size for metrics emissions > concourse/concourse#3937 - limitting number of max connections in db conn pools > concourse/concourse#4232 Signed-off-by: Ciro S. Costa <[email protected]> Co-authored-by: Zoe Tian <[email protected]>

aledegano force-pushed the active_tasks_lock_task_step branch 2 times, most recently from b10e607 to 4ef6c31 Compare July 12, 2019 10:35

aledegano changed the title ~~Implement new placement strategy: 'fewest-avtive-tasks'~~ Implement new placement strategy: 'fewest-active-tasks' Jul 12, 2019

ddadlani mentioned this pull request Jul 15, 2019

Add ATC/worker flags to limit max build containers for workers #2928

Closed

This was referenced Jul 16, 2019

Max-build-task-per-workers #4076

Closed

Add locking around steps. #4108

Closed

kcmannem reviewed Jul 16, 2019

View reviewed changes

atc/exec/task_step.go Outdated Show resolved Hide resolved

ddadlani requested review from ddadlani, nader-ziada and xtreme-sameer-vohra July 16, 2019 16:49

aledegano force-pushed the active_tasks_lock_task_step branch from 1966878 to e396d1b Compare July 17, 2019 09:04

aledegano changed the title ~~Implement new placement strategy: 'fewest-active-tasks'~~ Implement new placement strategy: 'limit-active-tasks' Jul 17, 2019

xtreme-sameer-vohra reviewed Jul 17, 2019

View reviewed changes

Alessandro Degano added 4 commits July 18, 2019 08:58

Regenerate fakes and fix existing tests.

0a26093

No new tests yet introduced. Signed-off-by: Alessandro Degano <[email protected]>

Add unit tests for 'fewest-active-tasks' strategy

32bf2c0

- placement: FewestActiveTasksPlacementStrategy - task_step: Increase and Decrease active tasks in fakeWorker - db.Worker: Increase and Decrease active tasks in DB Signed-off-by: Alessandro Degano <[email protected]>

Rename fewest-active-tasks to limit-active-tasks.

ab15b63

Rename ModifyActiveTasks() to ModifiesActiveTasks(). Signed-off-by: Alessandro Degano <[email protected]>

aledegano force-pushed the active_tasks_lock_task_step branch from e396d1b to ab15b63 Compare July 18, 2019 06:58

aledegano mentioned this pull request Jul 19, 2019

Document the experimental placement strategy: limit-active-tasks. concourse/docs#231

Merged

nader-ziada approved these changes Jul 19, 2019

View reviewed changes

nader-ziada merged commit aab8f4a into concourse:master Jul 19, 2019

aledegano deleted the active_tasks_lock_task_step branch July 19, 2019 14:39

jamieklassen mentioned this pull request Jul 19, 2019

add max-active-tasks-per-worker configuration concourse/concourse-bosh-release#46

Merged

aledegano mentioned this pull request Jul 22, 2019

Add active tasks observability. #4142

Merged

aledegano mentioned this pull request Jul 23, 2019

limit-active-tasks: warn when workers are busy. #4148

Merged

ddadlani pushed a commit that referenced this pull request Jul 23, 2019

Merge branch 'master' into separate-runtime

e38b147

Fix conflicts caused by PR #4118 - addition of `limit-task-step` container placement strategy

ddadlani mentioned this pull request Jul 30, 2019

Feature request: ability to limit number of tasks per worker #676

Closed

aledegano mentioned this pull request Aug 6, 2019

[bugfix] Do not increase active_tasks if the container is already present. #4221

Merged

vito mentioned this pull request Aug 14, 2019

Better build scheduling / load distribution #3695

Closed

cirocosta mentioned this pull request Aug 23, 2019

[stable/concourse] add new parameter for Concourse 5.5 helm/charts#15978

Merged

3 tasks

jamieklassen added the release/documented Documentation and release notes have been updated. label Aug 26, 2019

jamieklassen pushed a commit that referenced this pull request Aug 26, 2019

document limit-active-tasks strategy

ef9217c

#4118 #4148 #4208 #4277 #4142 #4221 #4293 Signed-off-by: James Thomson <[email protected]> Co-authored-by: Jamie Klassen <[email protected]>

matthewpereira pushed a commit that referenced this pull request Sep 5, 2019

document limit-active-tasks strategy

e82c93d

#4118 #4148 #4208 #4277 #4142 #4221 #4293 Signed-off-by: James Thomson <[email protected]> Co-authored-by: Jamie Klassen <[email protected]>

Uh oh!

Comments

Conversation

aledegano commented Jul 12, 2019

Uh oh!

Uh oh!

kcmannem commented Jul 16, 2019

Uh oh!

kcmannem commented Jul 16, 2019

Uh oh!

aledegano commented Jul 17, 2019

Uh oh!

aledegano commented Jul 17, 2019

Uh oh!

xtreme-sameer-vohra left a comment

Choose a reason for hiding this comment

Uh oh!

xtreme-sameer-vohra Jul 17, 2019

Choose a reason for hiding this comment

Uh oh!

aledegano Jul 18, 2019

Choose a reason for hiding this comment

Uh oh!

xtreme-sameer-vohra Jul 17, 2019

Choose a reason for hiding this comment

Uh oh!

aledegano Jul 18, 2019

Choose a reason for hiding this comment

Uh oh!

xtreme-sameer-vohra Jul 18, 2019

Choose a reason for hiding this comment

Uh oh!

aledegano commented Jul 18, 2019

Uh oh!

xtreme-sameer-vohra commented Jul 18, 2019

Uh oh!

aledegano commented Jul 19, 2019

Uh oh!

nader-ziada left a comment

Choose a reason for hiding this comment

Uh oh!

gerhard commented Sep 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gerhard commented Sep 3, 2019 •

edited

Loading