Max-build-task-per-workers by aledegano · Pull Request #4076 · concourse/concourse

aledegano · 2019-06-28T06:55:07Z

This is a POC to fix #2928.
It relies on the locking mechanism introduced by #3902 (which has been reverted but I'm hoping that will be re-merged fixed :D ).
Introduce a new column in the workers db table to store the number of active tasks in each worker.
Every time a new task starts on a worker the counter is incremented (inside the lock!) and when the task exits the counter is decremented (still in the lock).
When using the fewest-build-containers strategy, then, we select only the workers with the number of active tasks is less than what parametrized (note, the parameter for this draft is just hardcoded to 1).

This PR still needs some refinement but it works on my small test setup.
I would like some feedback to understand if I'm going in the right direction.

@vito @ddadlani @mhuangpivotal

…ase when task exit Signed-off-by: Alessandro Degano <[email protected]>

Signed-off-by: Alessandro Degano <[email protected]>

Fix existing tests. Implement maxTasks == 0 that falls back to the normal fewest-build-containers strategy. Signed-off-by: Alessandro Degano <[email protected]>

Signed-off-by: Alessandro Degano <[email protected]>

ddadlani

Hi @aledeganopix4d, thanks for working on this. I do have a few questions/concerns though.

I'm a bit concerned that the number of "active tasks" is being tracked separately i.e. we aren't using feedback from the worker to set this number. For example, if you restart the worker, there will be no running containers but the database will still have the old "active tasks" number.

I also didn't realize we would only be tracking tasks, I was under the impression we'd be building off of the build_containers number that is based off of the actual containers placed on the worker. We have the BuildContainersCountPerWorker method in the db_worker_factory that can be used.

Have I misunderstood the use case behind this PR? My understanding was that the workers would get overloaded due to "heavy" containers being placed together on them.

ddadlani · 2019-06-28T20:46:15Z

atc/atccmd/command.go

 	ResourceTypeCheckingInterval time.Duration `long:"resource-type-checking-interval" default:"1m" description:"Interval on which to check for new versions of resource types."`

 	ContainerPlacementStrategy        string        `long:"container-placement-strategy" default:"volume-locality" choice:"volume-locality" choice:"random" choice:"fewest-build-containers" description:"Method by which a worker is selected during container placement."`
+	MaxBuildTasksWorker               int           `long:"max-build-tasks-worker" default:"0" description:"Maximum allowed number of active build tasks per worker. Has effect only when used with fewest-build-containers placement strategy. 0 means no limit."`


Hi @aledeganopix4d, I thought we were going to limit the max build containers per worker? I didn't realize the limit was only for tasks.

If we limit max-build-containers-per-worker, we could leverage the existing build_containers column in the workers table.

Is better explained in the comments, but just for reference: while creating this PR I've realized that build_containers is not a reliable number to base the kind of scheduling I'm trying to achieve (as containers stays around long after the workers completed processing).

Hey @aledeganopix4d
To better understand the issue with not being able to use build_containers and tracking only task containers, do you mean resource get & put containers are staying around much longer after a build is completed ?

Hi @xtreme-sameer-vohra, no, that's not the case.
build_containers and task containers are actually the same thing: e.g. the container where Tasks are executed. However, the lifetime of a build_container is longer than the time the worker is actively running the task, this is because the build_container will stay on the worker until Garbage Collected which will happen:

After ~1 minute if the task was successful

After 1 hour if the task errored or failed
While the first case wouldn't be that bad, the second really throws a wrench into any scheduling system that relies on counting the build_containers to limit the amount of tasks on a worker.

Thanks for the clarification @aledeganopix4d . I understand what you mean now.

Internally within Concourse, the hierarchy is;

check containers

build containers

get

put

task ( successful & error'ed )

hook containers ( on_failure, on_success , etc )

Adding workload limiting at the build container level prior to adding it at the sub task level allows Concourse as a system to remain general purpose and then providing knobs to be optimal in specific cases.

Just a quick clarification, build_containers include get, put and task containers, basically just excluding check containers which are assumed to be fairly lightweight.

Ah! Sorry, that was indeed a misunderstanding on my part!

ddadlani · 2019-06-28T20:47:16Z

atc/atccmd/command.go

 	ResourceTypeCheckingInterval time.Duration `long:"resource-type-checking-interval" default:"1m" description:"Interval on which to check for new versions of resource types."`

 	ContainerPlacementStrategy        string        `long:"container-placement-strategy" default:"volume-locality" choice:"volume-locality" choice:"random" choice:"fewest-build-containers" description:"Method by which a worker is selected during container placement."`
+	MaxBuildTasksWorker               int           `long:"max-build-tasks-worker" default:"0" description:"Maximum allowed number of active build tasks per worker. Has effect only when used with fewest-build-containers placement strategy. 0 means no limit."`


Would it make more sense to have each worker specify its own limit (when connecting to the web nodes?) That way if you have workers with separate capacities, they could utilize their full capacity.

Yes! That's absolutely the end goal, for the time being I tried to keep the PR as minimal as possible.

atc/atccmd/command.go

ddadlani · 2019-06-28T20:50:47Z

atc/worker/pool.go

 			return nil, err
 		}
+
+		if metadata.Type == db.ContainerTypeTask {


The containerMetadata.Type field is not guaranteed to be accurate, it exists solely for metrics / output of fly containers. I'm not sure we should rely on it for business logic.

Oh, okay, thanks for pointing that out.
I'll find a more reliable place to increase the number of active tasks then!

aledegano · 2019-07-01T06:39:27Z

Hi @aledeganopix4d, thanks for working on this. I do have a few questions/concerns though.

I'm a bit concerned that the number of "active tasks" is being tracked separately i.e. we aren't using feedback from the worker to set this number. For example, if you restart the worker, there will be no running containers but the database will still have the old "active tasks" number.

I also didn't realize we would only be tracking tasks, I was under the impression we'd be building off of the build_containers number that is based off of the actual containers placed on the worker. We have the BuildContainersCountPerWorker method in the db_worker_factory that can be used.

Have I misunderstood the use case behind this PR? My understanding was that the workers would get overloaded due to "heavy" containers being placed together on them.

Hello @ddadlani and first of all thank you very much for your feedback!
I'll try to motivate a bit better about the reasoning of "max tasks":
The main problem in using the number of "build containers" is that they have a very different lifetime than the actual task: think about failed task, aborted jobs, errored jobs etc.
At first I had tried that but I quickly realized that, because of the above, I could even end up in deadlock!
Imagine this example:

A pipeline with a job with multiple in_parallel tasks
You set the max build containers to 1 and you have 1 worker (it works for any numbers however)
One task errors
All the following tasks are pending until the container executing the errored one gets GC (1 hour?)
This is why I decided to introduce the concept of active tasks so they can have a lifetime independent of the containers.

About the restart of the workers: you are totally right and in that case I should reset the counter.

…ent the active tasks count. Remove check on container metadata. Signed-off-by: Alessandro Degano <[email protected]>

Signed-off-by: Alessandro Degano <[email protected]>

aledegano · 2019-07-01T12:58:05Z

@ddadlani following your comments I've updated the PR with the following changes:

Now the task_step also takes care of signaling the worker when a task starts, instead of relying on the container metadata
When a worker starts and it register the active_tasks counter is always set to 0 resetting it in any case of restart too
I've decoupled the placement strategy: now there's a dedicated fewest-active-tasks that choses the worker with the least amount of running tasks and if --max-active-tasks-worker is >0 also limits the number of tasks on the workers.

xtreme-sameer-vohra · 2019-07-03T15:51:17Z

Hey @aledeganopix4d
Thank you for taking the time to tackle an issue that lot of users are feeling pain around.

Joining the party late :) but a few thoughts on looking at this PR

How might we be able to handle the more generalized case of limiting work per worker ( thinking in terms of one step higher than active tasks per worker to containers per worker ). This allows the feature to be useful for a bunch of different users and use cases. If this makes sense, then you could utilize worker tags to split resource get & put steps on workers with no container limits and have tighter execution limits on task step workers.
If a user were to hijack into a failed step and execute anything, this would result in the same degraded performance that Add ATC/worker flags to limit max build containers for workers #2928 is intending to tackle. This is also relevant for fly execute cases.
Depending on the environment, a worker might have to re-register frequently due to a network blip or worker rebalance, which means the active tasks counter in the DB would have to be synced with the worker. Setting the counter to 0 on registration may not reflect the reality on a worker which could still be executing a task. This would mean, the worker would also have to track active tasks and report these back to the web/atc on heartbeats.

ddadlani · 2019-07-03T16:54:21Z

Hi @xtreme-sameer-vohra , @aledeganopix4d ,

Depending on the environment, a worker might have to re-register frequently due to a network blip or worker rebalance, which means the active tasks counter in the DB would have to be synced with the worker. Setting the counter to 0 on registration may not reflect the reality on a worker which could still be executing a task. This would mean, the worker would also have to track active tasks and report these back to the web/atc on heartbeats.

This is one of the reasons I preferred the use of build containers for the scheduling decision as opposed to tasks, because containers are already heartbeated/regularly reported by the worker.

I'm also a bit worried that tracking active_tasks isn't general enough. There are users that get and put very large resources, for example, and their bottleneck might not be the active tasks.

At first I had tried that but I quickly realized that, because of the above, I could even end up in deadlock!
Imagine this example:

A pipeline with a job with multiple in_parallel tasks
You set the max build containers to 1 and you have 1 worker (it works for any numbers however)
One task errors
All the following tasks are pending until the container executing the errored one gets GC (1 hour?)

This is a very valid concern, but it brings up issues with this kind of scheduling optimization in general. We don't account for the state of the containers on a worker (e.g. creating, destroying, created etc) but we can't because one can still hit Garden's container limit of 250 regardless of what state the containers on a worker are in. If we don't actually measure the actual resource usage of each container, we're just making guesses as to which kinds of containers are most likely to bring down a worker. I can see the argument that if we have max-active-tasks-per-worker we should also have max-active-gets and puts too.

aledegano · 2019-07-04T06:57:52Z

Hi @ddadlani and again thank you for taking the time to engage in the discussion.

Hi @xtreme-sameer-vohra , @aledeganopix4d ,

Depending on the environment, a worker might have to re-register frequently due to a network blip or worker rebalance, which means the active tasks counter in the DB would have to be synced with the worker. Setting the counter to 0 on registration may not reflect the reality on a worker which could still be executing a task. This would mean, the worker would also have to track active tasks and report these back to the web/atc on heartbeats.

This is one of the reasons I preferred the use of build containers for the scheduling decision as opposed to tasks, because containers are already heartbeated/regularly reported by the worker.

I'm also a bit worried that tracking active_tasks isn't general enough. There are users that get and put very large resources, for example, and their bottleneck might not be the active tasks.

This is a totally valid point, I would argue that since the worker resources typically engaged in put/get and task are different as while in the former I expect mostly network and I/O in the latter I expect heavier computation (that's of course assuming that the task itself isn't also doing network and I/O because the pipeline is used in the idiomatic way, but we'll never be sure...).
Which means that there would be two different queues for put/get and tasks.

At first I had tried that but I quickly realized that, because of the above, I could even end up in deadlock!
Imagine this example:
A pipeline with a job with multiple in_parallel tasks
You set the max build containers to 1 and you have 1 worker (it works for any numbers however)
One task errors
All the following tasks are pending until the container executing the errored one gets GC (1 hour?)

This is a very valid concern, but it brings up issues with this kind of scheduling optimization in general. We don't account for the state of the containers on a worker (e.g. creating, destroying, created etc) but we can't because one can still hit Garden's container limit of 250 regardless of what state the containers on a worker are in. If we don't actually measure the actual resource usage of each container, we're just making guesses as to which kinds of containers are most likely to bring down a worker. I can see the argument that if we have max-active-tasks-per-worker we should also have max-active-gets and puts too.

I understand your point and I think this is a discussion that goes well beyond the scope of this PR: in my experience as CI operator (on multiple systems, not only Concourse) the observation is that if you wait to see which resources a running task is employing then it's already too late for the scheduling as you would need to rebalance the system in flight (e.g. moving around containers to workers with lighter load).
In my understanding this kind of operations are possible in complex container orchestrators but seems way beyond the scope of Concourse.
Thus, the only strategy that a CI system can employ is to simply assume in advance that a given task will use a set amount of resources and then schedule according to that.

I think it's totally reasonable to introduce max-active-gets and max-active-puts, even though they could probably be the same limit as I'd expect the resources employed to be very similar, as additional optional parameter to this new placement strategy.

aledegano · 2019-07-04T09:02:43Z

Hi @xtreme-sameer-vohra thank you for taking the time to leave some feedback!

Hey @aledeganopix4d
Thank you for taking the time to tackle an issue that lot of users are feeling pain around.

Joining the party late :) but a few thoughts on looking at this PR

How might we be able to handle the more generalized case of limiting work per worker ( thinking in terms of one step higher than active tasks per worker to containers per worker ). This allows the feature to be useful for a bunch of different users and use cases. If this makes sense, then you could utilize worker tags to split resource get & put steps on workers with no container limits and have tighter execution limits on task step workers.

I was thinking that maybe we could have a max-get-put-taks-worker (one, or both) optional arguments and the relative counters if that's really a big use-case.

If a user were to hijack into a failed step and execute anything, this would result in the same degraded performance that Add ATC/worker flags to limit max build containers for workers #2928 is intending to tackle. This is also relevant for fly execute cases.

In its current state the PR does not take this into account, true, but I think it would be quite easy to increment the active tasks on the worker when a containers on it get hijacked. Similarly when fly execute is running something on the worker.

Depending on the environment, a worker might have to re-register frequently due to a network blip or worker rebalance, which means the active tasks counter in the DB would have to be synced with the worker. Setting the counter to 0 on registration may not reflect the reality on a worker which could still be executing a task. This would mean, the worker would also have to track active tasks and report these back to the web/atc on heartbeats.

You're right I didn't consider this. The problem is that it's hard to know:

For the worker if a task is running (or is completed)
or
For the ATC if the worker is connecting for the first time or reconnecting after a network reset

I'm not sure how this could be addressed.

ddadlani · 2019-07-05T18:14:25Z

Hi @aledeganopix4d,

The problem is that it's hard to know:

For the worker if a task is running (or is completed)
or
For the ATC if the worker is connecting for the first time or reconnecting after a network reset
I'm not sure how this could be addressed.

I think you hit the nail on the head here. I think the most useful distinction to make would be "active" build containers vs "inactive", read: destroying, initializing, errored etc. This would be the most general answer to the race problem that occurs if we naively limit the number of build containers on the worker. The problem is that there is no direct correlation in the runtime between a step's state and its container's state. To me, this is a problem that requires more design and that makes this issue not as easy to resolve as we originally thought.

I'm hesitant to have separate limits for active tasks, active gets and active puts because:

we need to have some kind of correlation between step state and container state, which we don't have
the numbers of active step containers would need to be regularly updated using information from the worker because they can so easily fall out of sync in so many different ways.

This leads me to think that this is not a straightforward stop-gap solution that can easily be implemented before the scheduling queue can be implemented. Does that make sense? What are your thoughts on it?

aledegano · 2019-07-08T12:32:37Z

Hi @aledeganopix4d,

The problem is that it's hard to know:
For the worker if a task is running (or is completed)
or
For the ATC if the worker is connecting for the first time or reconnecting after a network reset
I'm not sure how this could be addressed.

I think you hit the nail on the head here. I think the most useful distinction to make would be "active" build containers vs "inactive", read: destroying, initializing, errored etc. This would be the most general answer to the race problem that occurs if we naively limit the number of build containers on the worker. The problem is that there is no direct correlation in the runtime between a step's state and its container's state. To me, this is a problem that requires more design and that makes this issue not as easy to resolve as we originally thought.

I agree and I think it's relatively easy to expand on the idea I've proposed here: when we run steps (regardless of task, put, get) we can modify the db Container to be "running" and then "completed" when the same steps complete (regardless of how they complete).

I'm hesitant to have separate limits for active tasks, active gets and active puts because:

we need to have some kind of correlation between step state and container state, which we don't have

the numbers of active step containers would need to be regularly updated using information from the worker because they can so easily fall out of sync in so many different ways.

Agree, I'm thinking that what I've wrote above could address these two points.

This leads me to think that this is not a straightforward stop-gap solution that can easily be implemented before the scheduling queue can be implemented. Does that make sense? What are your thoughts on it?

I understand that - however we tackle this task - it will require more workarounds. Moreover it also depends on how we fix #3301 as whatever serializing mechanism we use there will be used here too (btw I've given a shot to fix that here: #4108).
However, while I'm rooting and cheering to have a scheduling queue ASAP, I believe that will take some time and in the meanwhile we could have a "serviceable" code that prevents workers to be overloaded, it will be a huge win for my team and, hopefully, for others too! :D

ddadlani

Hey @aledeganopix4d,

Thanks for working on this feature! I still have a few reservations around ways that the number of active tasks may be inconsistent, but I think based on our discussion and with the changes I have suggested, it should be good for a v1 of this feature. We can document this feature as experimental and allow users to opt in to minimize impact on the general users.

Also regarding PR #4108 that adds in the container creating lock. Could you add a lock in this PR that only locks around running a task step? I understand that you originally wrote #4076 with the assumption that the ContainerCreatingLock would be there, but I don't think we need to lock around all kinds of container creation for this PR. We could add in a TaskStepLock or something similar to ensure that active_tasks for each worker is accurate. This new lock could also be opt-in, i.e. we don't try to serialize the task step if the user has not set max-active-tasks-worker.

I will leave my thoughts on #4108 on that PR.

Last request, can we also get some more test coverage for this feature?

Let me know if you have any questions!

ddadlani · 2019-07-09T14:22:45Z

atc/db/worker_factory.go

 		workerVersion = &atcWorker.Version
 	}

+	activeTasks := 0 // When a worker starts up by definition has no active tasks.


When calling saveWorker, it is possible for the worker to already exist in the DB. This can happen in a few ways:

when there is a network blip and the worker is marked as stalled and then re-registers

when the worker rebalances to another ATC after some time.

In these cases, we would expect that active_tasks should remain the same as they were previously. active_tasks should only be set to 0 for new/recreated workers.

ddadlani · 2019-07-09T14:26:47Z

atc/exec/task_step.go


 	go func() {
 		processStatus, processErr = process.Wait()
+		logger.Info("Decreasing active tasks on worker.")


We would also want to decrement the number of active tasks in here

If the task was scheduled on a worker, but then the ATC died and another ATC picked up the task, it would go through this flow. If the task has already completed on the worker, we want to decrement the number of tasks to reflect reality.

ddadlani · 2019-07-09T14:31:51Z

atc/atccmd/command.go


-	ContainerPlacementStrategy        string        `long:"container-placement-strategy" default:"volume-locality" choice:"volume-locality" choice:"random" choice:"fewest-build-containers" description:"Method by which a worker is selected during container placement."`
+	ContainerPlacementStrategy        string        `long:"container-placement-strategy" default:"volume-locality" choice:"volume-locality" choice:"random" choice:"fewest-build-containers" choice:"fewest-active-tasks" description:"Method by which a worker is selected during container placement."`
+	MaxActiveTasksWorker              int           `long:"max-active-tasks-worker" default:"0" description:"Maximum allowed number of active build tasks per worker. Has effect only when used with fewest-active-tasks placement strategy. 0 means no limit."`


minor nitpick: can we call this MaxActiveTasksPerWorker? :) I don't think the command-line flag needs to change though.

aledegano · 2019-07-10T08:21:04Z

Hey @aledeganopix4d,

Thanks for working on this feature! I still have a few reservations around ways that the number of active tasks may be inconsistent, but I think based on our discussion and with the changes I have suggested, it should be good for a v1 of this feature. We can document this feature as experimental and allow users to opt in to minimize impact on the general users.

Also regarding PR #4108 that adds in the container creating lock. Could you add a lock in this PR that only locks around running a task step? I understand that you originally wrote #4076 with the assumption that the ContainerCreatingLock would be there, but I don't think we need to lock around all kinds of container creation for this PR. We could add in a TaskStepLock or something similar to ensure that active_tasks for each worker is accurate. This new lock could also be opt-in, i.e. we don't try to serialize the task step if the user has not set max-active-tasks-worker.

I will leave my thoughts on #4108 on that PR.

Last request, can we also get some more test coverage for this feature?

Let me know if you have any questions!

I agree with all your comments and requests.
I'll start working on this ASAP!
Thank you for all your feedback.

aledegano · 2019-07-16T12:52:39Z

Superseded by #4118

aledegano force-pushed the PCI-675-active_tasks_worker_db branch 2 times, most recently from 888244f to 2ac9a6b Compare June 28, 2019 11:57

Alessandro Degano added 4 commits June 28, 2019 14:00

Store active_tasks per worker in DB. Increase when task starts. Decre…

dc50a51

…ase when task exit Signed-off-by: Alessandro Degano <[email protected]>

Limit max tasks per worker.

afb9e8b

Signed-off-by: Alessandro Degano <[email protected]>

Parametrize max-build-tasks-worker from command line argument.

220d615

Signed-off-by: Alessandro Degano <[email protected]>

Constraint active_tasks to positive values.

4ecd03a

Signed-off-by: Alessandro Degano <[email protected]>

aledegano force-pushed the PCI-675-active_tasks_worker_db branch from 2ac9a6b to 5c43837 Compare June 28, 2019 12:01

Regenerate affected existing fakes.

aaea9d8

Fix existing tests. Implement maxTasks == 0 that falls back to the normal fewest-build-containers strategy. Signed-off-by: Alessandro Degano <[email protected]>

aledegano force-pushed the PCI-675-active_tasks_worker_db branch from 5c43837 to aaea9d8 Compare June 28, 2019 12:10

Add unit test for placement when max-tasks-worker is set

fdabfa2

Signed-off-by: Alessandro Degano <[email protected]>

marco-m mentioned this pull request Jun 28, 2019

manually revert changes from PR #3902 #4075

Merged

ddadlani suggested changes Jun 28, 2019

View reviewed changes

Alessandro Degano added 3 commits July 1, 2019 14:53

Use FindOrChooseWorkerForContainer parameter to decide when to increm…

7fb7f93

…ent the active tasks count. Remove check on container metadata. Signed-off-by: Alessandro Degano <[email protected]>

Ensure active_tasks is zero when worker starts.

5017616

Signed-off-by: Alessandro Degano <[email protected]>

Add dedicated placement strategy to active tasks.

7d00d54

Signed-off-by: Alessandro Degano <[email protected]>

ddadlani suggested changes Jul 9, 2019

View reviewed changes

aledegano mentioned this pull request Jul 9, 2019

Add locking around steps. #4108

Closed

aledegano mentioned this pull request Jul 12, 2019

Implement new placement strategy: 'limit-active-tasks' #4118

Merged

aledegano closed this Jul 16, 2019

aledegano deleted the PCI-675-active_tasks_worker_db branch July 16, 2019 12:52

Uh oh!

Comments

Conversation

aledegano commented Jun 28, 2019

Uh oh!

ddadlani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xtreme-sameer-vohra Jul 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aledegano commented Jul 1, 2019

Uh oh!

aledegano commented Jul 1, 2019

Uh oh!

xtreme-sameer-vohra commented Jul 3, 2019

Uh oh!

ddadlani commented Jul 3, 2019

Uh oh!

aledegano commented Jul 4, 2019

Uh oh!

aledegano commented Jul 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddadlani commented Jul 5, 2019

Uh oh!

aledegano commented Jul 8, 2019

Uh oh!

ddadlani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aledegano commented Jul 10, 2019

Uh oh!

aledegano commented Jul 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xtreme-sameer-vohra Jul 3, 2019 •

edited

Loading

aledegano commented Jul 4, 2019 •

edited

Loading