Skip to content

Comments

Max-build-task-per-workers#4076

Closed
aledegano wants to merge 9 commits intoconcourse:masterfrom
Pix4D:PCI-675-active_tasks_worker_db
Closed

Max-build-task-per-workers#4076
aledegano wants to merge 9 commits intoconcourse:masterfrom
Pix4D:PCI-675-active_tasks_worker_db

Conversation

@aledegano
Copy link
Contributor

This is a POC to fix #2928.
It relies on the locking mechanism introduced by #3902 (which has been reverted but I'm hoping that will be re-merged fixed :D ).
Introduce a new column in the workers db table to store the number of active tasks in each worker.
Every time a new task starts on a worker the counter is incremented (inside the lock!) and when the task exits the counter is decremented (still in the lock).
When using the fewest-build-containers strategy, then, we select only the workers with the number of active tasks is less than what parametrized (note, the parameter for this draft is just hardcoded to 1).

This PR still needs some refinement but it works on my small test setup.
I would like some feedback to understand if I'm going in the right direction.

@vito @ddadlani @mhuangpivotal

@aledegano aledegano force-pushed the PCI-675-active_tasks_worker_db branch 2 times, most recently from 888244f to 2ac9a6b Compare June 28, 2019 11:57
Alessandro Degano added 4 commits June 28, 2019 14:00
@aledegano aledegano force-pushed the PCI-675-active_tasks_worker_db branch from 2ac9a6b to 5c43837 Compare June 28, 2019 12:01
Fix existing tests.
Implement maxTasks == 0 that falls back
to the normal fewest-build-containers strategy.

Signed-off-by: Alessandro Degano <[email protected]>
@aledegano aledegano force-pushed the PCI-675-active_tasks_worker_db branch from 5c43837 to aaea9d8 Compare June 28, 2019 12:10
Copy link
Contributor

@ddadlani ddadlani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aledeganopix4d, thanks for working on this. I do have a few questions/concerns though.

I'm a bit concerned that the number of "active tasks" is being tracked separately i.e. we aren't using feedback from the worker to set this number. For example, if you restart the worker, there will be no running containers but the database will still have the old "active tasks" number.

I also didn't realize we would only be tracking tasks, I was under the impression we'd be building off of the build_containers number that is based off of the actual containers placed on the worker. We have the BuildContainersCountPerWorker method in the db_worker_factory that can be used.

Have I misunderstood the use case behind this PR? My understanding was that the workers would get overloaded due to "heavy" containers being placed together on them.

ResourceTypeCheckingInterval time.Duration `long:"resource-type-checking-interval" default:"1m" description:"Interval on which to check for new versions of resource types."`

ContainerPlacementStrategy string `long:"container-placement-strategy" default:"volume-locality" choice:"volume-locality" choice:"random" choice:"fewest-build-containers" description:"Method by which a worker is selected during container placement."`
MaxBuildTasksWorker int `long:"max-build-tasks-worker" default:"0" description:"Maximum allowed number of active build tasks per worker. Has effect only when used with fewest-build-containers placement strategy. 0 means no limit."`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aledeganopix4d, I thought we were going to limit the max build containers per worker? I didn't realize the limit was only for tasks.

If we limit max-build-containers-per-worker, we could leverage the existing build_containers column in the workers table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is better explained in the comments, but just for reference: while creating this PR I've realized that build_containers is not a reliable number to base the kind of scheduling I'm trying to achieve (as containers stays around long after the workers completed processing).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aledeganopix4d
To better understand the issue with not being able to use build_containers and tracking only task containers, do you mean resource get & put containers are staying around much longer after a build is completed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xtreme-sameer-vohra, no, that's not the case.
build_containers and task containers are actually the same thing: e.g. the container where Tasks are executed. However, the lifetime of a build_container is longer than the time the worker is actively running the task, this is because the build_container will stay on the worker until Garbage Collected which will happen:

  1. After ~1 minute if the task was successful
  2. After 1 hour if the task errored or failed
    While the first case wouldn't be that bad, the second really throws a wrench into any scheduling system that relies on counting the build_containers to limit the amount of tasks on a worker.

Copy link
Contributor

@xtreme-sameer-vohra xtreme-sameer-vohra Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification @aledeganopix4d . I understand what you mean now.

Internally within Concourse, the hierarchy is;

  • check containers
  • build containers
    • get
    • put
    • task ( successful & error'ed )
    • hook containers ( on_failure, on_success , etc )

Adding workload limiting at the build container level prior to adding it at the sub task level allows Concourse as a system to remain general purpose and then providing knobs to be optimal in specific cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick clarification, build_containers include get, put and task containers, basically just excluding check containers which are assumed to be fairly lightweight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Sorry, that was indeed a misunderstanding on my part!

ResourceTypeCheckingInterval time.Duration `long:"resource-type-checking-interval" default:"1m" description:"Interval on which to check for new versions of resource types."`

ContainerPlacementStrategy string `long:"container-placement-strategy" default:"volume-locality" choice:"volume-locality" choice:"random" choice:"fewest-build-containers" description:"Method by which a worker is selected during container placement."`
MaxBuildTasksWorker int `long:"max-build-tasks-worker" default:"0" description:"Maximum allowed number of active build tasks per worker. Has effect only when used with fewest-build-containers placement strategy. 0 means no limit."`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to have each worker specify its own limit (when connecting to the web nodes?) That way if you have workers with separate capacities, they could utilize their full capacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! That's absolutely the end goal, for the time being I tried to keep the PR as minimal as possible.

return nil, err
}

if metadata.Type == db.ContainerTypeTask {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The containerMetadata.Type field is not guaranteed to be accurate, it exists solely for metrics / output of fly containers. I'm not sure we should rely on it for business logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay, thanks for pointing that out.
I'll find a more reliable place to increase the number of active tasks then!

@aledegano
Copy link
Contributor Author

Hi @aledeganopix4d, thanks for working on this. I do have a few questions/concerns though.

I'm a bit concerned that the number of "active tasks" is being tracked separately i.e. we aren't using feedback from the worker to set this number. For example, if you restart the worker, there will be no running containers but the database will still have the old "active tasks" number.

I also didn't realize we would only be tracking tasks, I was under the impression we'd be building off of the build_containers number that is based off of the actual containers placed on the worker. We have the BuildContainersCountPerWorker method in the db_worker_factory that can be used.

Have I misunderstood the use case behind this PR? My understanding was that the workers would get overloaded due to "heavy" containers being placed together on them.

Hello @ddadlani and first of all thank you very much for your feedback!
I'll try to motivate a bit better about the reasoning of "max tasks":
The main problem in using the number of "build containers" is that they have a very different lifetime than the actual task: think about failed task, aborted jobs, errored jobs etc.
At first I had tried that but I quickly realized that, because of the above, I could even end up in deadlock!
Imagine this example:

  • A pipeline with a job with multiple in_parallel tasks
  • You set the max build containers to 1 and you have 1 worker (it works for any numbers however)
  • One task errors
  • All the following tasks are pending until the container executing the errored one gets GC (1 hour?)
    This is why I decided to introduce the concept of active tasks so they can have a lifetime independent of the containers.

About the restart of the workers: you are totally right and in that case I should reset the counter.

Alessandro Degano added 3 commits July 1, 2019 14:53
…ent the active tasks count.

Remove check on container metadata.

Signed-off-by: Alessandro Degano <[email protected]>
@aledegano
Copy link
Contributor Author

@ddadlani following your comments I've updated the PR with the following changes:

  1. Now the task_step also takes care of signaling the worker when a task starts, instead of relying on the container metadata
  2. When a worker starts and it register the active_tasks counter is always set to 0 resetting it in any case of restart too
  3. I've decoupled the placement strategy: now there's a dedicated fewest-active-tasks that choses the worker with the least amount of running tasks and if --max-active-tasks-worker is >0 also limits the number of tasks on the workers.

@xtreme-sameer-vohra
Copy link
Contributor

Hey @aledeganopix4d
Thank you for taking the time to tackle an issue that lot of users are feeling pain around.

Joining the party late :) but a few thoughts on looking at this PR

  • How might we be able to handle the more generalized case of limiting work per worker ( thinking in terms of one step higher than active tasks per worker to containers per worker ). This allows the feature to be useful for a bunch of different users and use cases. If this makes sense, then you could utilize worker tags to split resource get & put steps on workers with no container limits and have tighter execution limits on task step workers.
  • If a user were to hijack into a failed step and execute anything, this would result in the same degraded performance that Add ATC/worker flags to limit max build containers for workers #2928 is intending to tackle. This is also relevant for fly execute cases.
  • Depending on the environment, a worker might have to re-register frequently due to a network blip or worker rebalance, which means the active tasks counter in the DB would have to be synced with the worker. Setting the counter to 0 on registration may not reflect the reality on a worker which could still be executing a task. This would mean, the worker would also have to track active tasks and report these back to the web/atc on heartbeats.

@ddadlani
Copy link
Contributor

ddadlani commented Jul 3, 2019

Hi @xtreme-sameer-vohra , @aledeganopix4d ,

Depending on the environment, a worker might have to re-register frequently due to a network blip or worker rebalance, which means the active tasks counter in the DB would have to be synced with the worker. Setting the counter to 0 on registration may not reflect the reality on a worker which could still be executing a task. This would mean, the worker would also have to track active tasks and report these back to the web/atc on heartbeats.

This is one of the reasons I preferred the use of build containers for the scheduling decision as opposed to tasks, because containers are already heartbeated/regularly reported by the worker.

I'm also a bit worried that tracking active_tasks isn't general enough. There are users that get and put very large resources, for example, and their bottleneck might not be the active tasks.

At first I had tried that but I quickly realized that, because of the above, I could even end up in deadlock!
Imagine this example:

A pipeline with a job with multiple in_parallel tasks
You set the max build containers to 1 and you have 1 worker (it works for any numbers however)
One task errors
All the following tasks are pending until the container executing the errored one gets GC (1 hour?)

This is a very valid concern, but it brings up issues with this kind of scheduling optimization in general. We don't account for the state of the containers on a worker (e.g. creating, destroying, created etc) but we can't because one can still hit Garden's container limit of 250 regardless of what state the containers on a worker are in. If we don't actually measure the actual resource usage of each container, we're just making guesses as to which kinds of containers are most likely to bring down a worker. I can see the argument that if we have max-active-tasks-per-worker we should also have max-active-gets and puts too.

@aledegano
Copy link
Contributor Author

Hi @ddadlani and again thank you for taking the time to engage in the discussion.

Hi @xtreme-sameer-vohra , @aledeganopix4d ,

Depending on the environment, a worker might have to re-register frequently due to a network blip or worker rebalance, which means the active tasks counter in the DB would have to be synced with the worker. Setting the counter to 0 on registration may not reflect the reality on a worker which could still be executing a task. This would mean, the worker would also have to track active tasks and report these back to the web/atc on heartbeats.

This is one of the reasons I preferred the use of build containers for the scheduling decision as opposed to tasks, because containers are already heartbeated/regularly reported by the worker.

I'm also a bit worried that tracking active_tasks isn't general enough. There are users that get and put very large resources, for example, and their bottleneck might not be the active tasks.

This is a totally valid point, I would argue that since the worker resources typically engaged in put/get and task are different as while in the former I expect mostly network and I/O in the latter I expect heavier computation (that's of course assuming that the task itself isn't also doing network and I/O because the pipeline is used in the idiomatic way, but we'll never be sure...).
Which means that there would be two different queues for put/get and tasks.

At first I had tried that but I quickly realized that, because of the above, I could even end up in deadlock!
Imagine this example:
A pipeline with a job with multiple in_parallel tasks
You set the max build containers to 1 and you have 1 worker (it works for any numbers however)
One task errors
All the following tasks are pending until the container executing the errored one gets GC (1 hour?)

This is a very valid concern, but it brings up issues with this kind of scheduling optimization in general. We don't account for the state of the containers on a worker (e.g. creating, destroying, created etc) but we can't because one can still hit Garden's container limit of 250 regardless of what state the containers on a worker are in. If we don't actually measure the actual resource usage of each container, we're just making guesses as to which kinds of containers are most likely to bring down a worker. I can see the argument that if we have max-active-tasks-per-worker we should also have max-active-gets and puts too.

I understand your point and I think this is a discussion that goes well beyond the scope of this PR: in my experience as CI operator (on multiple systems, not only Concourse) the observation is that if you wait to see which resources a running task is employing then it's already too late for the scheduling as you would need to rebalance the system in flight (e.g. moving around containers to workers with lighter load).
In my understanding this kind of operations are possible in complex container orchestrators but seems way beyond the scope of Concourse.
Thus, the only strategy that a CI system can employ is to simply assume in advance that a given task will use a set amount of resources and then schedule according to that.

I think it's totally reasonable to introduce max-active-gets and max-active-puts, even though they could probably be the same limit as I'd expect the resources employed to be very similar, as additional optional parameter to this new placement strategy.

@aledegano
Copy link
Contributor Author

aledegano commented Jul 4, 2019

Hi @xtreme-sameer-vohra thank you for taking the time to leave some feedback!

Hey @aledeganopix4d
Thank you for taking the time to tackle an issue that lot of users are feeling pain around.

Joining the party late :) but a few thoughts on looking at this PR

  • How might we be able to handle the more generalized case of limiting work per worker ( thinking in terms of one step higher than active tasks per worker to containers per worker ). This allows the feature to be useful for a bunch of different users and use cases. If this makes sense, then you could utilize worker tags to split resource get & put steps on workers with no container limits and have tighter execution limits on task step workers.

I was thinking that maybe we could have a max-get-put-taks-worker (one, or both) optional arguments and the relative counters if that's really a big use-case.

In its current state the PR does not take this into account, true, but I think it would be quite easy to increment the active tasks on the worker when a containers on it get hijacked. Similarly when fly execute is running something on the worker.

  • Depending on the environment, a worker might have to re-register frequently due to a network blip or worker rebalance, which means the active tasks counter in the DB would have to be synced with the worker. Setting the counter to 0 on registration may not reflect the reality on a worker which could still be executing a task. This would mean, the worker would also have to track active tasks and report these back to the web/atc on heartbeats.

You're right I didn't consider this. The problem is that it's hard to know:

  1. For the worker if a task is running (or is completed)
    or
  2. For the ATC if the worker is connecting for the first time or reconnecting after a network reset

I'm not sure how this could be addressed.

@ddadlani
Copy link
Contributor

ddadlani commented Jul 5, 2019

Hi @aledeganopix4d,

The problem is that it's hard to know:

For the worker if a task is running (or is completed)
or
For the ATC if the worker is connecting for the first time or reconnecting after a network reset
I'm not sure how this could be addressed.

I think you hit the nail on the head here. I think the most useful distinction to make would be "active" build containers vs "inactive", read: destroying, initializing, errored etc. This would be the most general answer to the race problem that occurs if we naively limit the number of build containers on the worker. The problem is that there is no direct correlation in the runtime between a step's state and its container's state. To me, this is a problem that requires more design and that makes this issue not as easy to resolve as we originally thought.

I'm hesitant to have separate limits for active tasks, active gets and active puts because:

  1. we need to have some kind of correlation between step state and container state, which we don't have
  2. the numbers of active step containers would need to be regularly updated using information from the worker because they can so easily fall out of sync in so many different ways.

This leads me to think that this is not a straightforward stop-gap solution that can easily be implemented before the scheduling queue can be implemented. Does that make sense? What are your thoughts on it?

@aledegano
Copy link
Contributor Author

Hi @aledeganopix4d,

The problem is that it's hard to know:
For the worker if a task is running (or is completed)
or
For the ATC if the worker is connecting for the first time or reconnecting after a network reset
I'm not sure how this could be addressed.

I think you hit the nail on the head here. I think the most useful distinction to make would be "active" build containers vs "inactive", read: destroying, initializing, errored etc. This would be the most general answer to the race problem that occurs if we naively limit the number of build containers on the worker. The problem is that there is no direct correlation in the runtime between a step's state and its container's state. To me, this is a problem that requires more design and that makes this issue not as easy to resolve as we originally thought.

I agree and I think it's relatively easy to expand on the idea I've proposed here: when we run steps (regardless of task, put, get) we can modify the db Container to be "running" and then "completed" when the same steps complete (regardless of how they complete).

I'm hesitant to have separate limits for active tasks, active gets and active puts because:

  1. we need to have some kind of correlation between step state and container state, which we don't have
  2. the numbers of active step containers would need to be regularly updated using information from the worker because they can so easily fall out of sync in so many different ways.

Agree, I'm thinking that what I've wrote above could address these two points.

This leads me to think that this is not a straightforward stop-gap solution that can easily be implemented before the scheduling queue can be implemented. Does that make sense? What are your thoughts on it?

I understand that - however we tackle this task - it will require more workarounds. Moreover it also depends on how we fix #3301 as whatever serializing mechanism we use there will be used here too (btw I've given a shot to fix that here: #4108).
However, while I'm rooting and cheering to have a scheduling queue ASAP, I believe that will take some time and in the meanwhile we could have a "serviceable" code that prevents workers to be overloaded, it will be a huge win for my team and, hopefully, for others too! :D

Copy link
Contributor

@ddadlani ddadlani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aledeganopix4d,

Thanks for working on this feature! I still have a few reservations around ways that the number of active tasks may be inconsistent, but I think based on our discussion and with the changes I have suggested, it should be good for a v1 of this feature. We can document this feature as experimental and allow users to opt in to minimize impact on the general users.

Also regarding PR #4108 that adds in the container creating lock. Could you add a lock in this PR that only locks around running a task step? I understand that you originally wrote #4076 with the assumption that the ContainerCreatingLock would be there, but I don't think we need to lock around all kinds of container creation for this PR. We could add in a TaskStepLock or something similar to ensure that active_tasks for each worker is accurate. This new lock could also be opt-in, i.e. we don't try to serialize the task step if the user has not set max-active-tasks-worker.

I will leave my thoughts on #4108 on that PR.

Last request, can we also get some more test coverage for this feature?

Let me know if you have any questions!

workerVersion = &atcWorker.Version
}

activeTasks := 0 // When a worker starts up by definition has no active tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calling saveWorker, it is possible for the worker to already exist in the DB. This can happen in a few ways:

  1. when there is a network blip and the worker is marked as stalled and then re-registers
  2. when the worker rebalances to another ATC after some time.

In these cases, we would expect that active_tasks should remain the same as they were previously. active_tasks should only be set to 0 for new/recreated workers.


go func() {
processStatus, processErr = process.Wait()
logger.Info("Decreasing active tasks on worker.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also want to decrement the number of active tasks in here

If the task was scheduled on a worker, but then the ATC died and another ATC picked up the task, it would go through this flow. If the task has already completed on the worker, we want to decrement the number of tasks to reflect reality.


ContainerPlacementStrategy string `long:"container-placement-strategy" default:"volume-locality" choice:"volume-locality" choice:"random" choice:"fewest-build-containers" description:"Method by which a worker is selected during container placement."`
ContainerPlacementStrategy string `long:"container-placement-strategy" default:"volume-locality" choice:"volume-locality" choice:"random" choice:"fewest-build-containers" choice:"fewest-active-tasks" description:"Method by which a worker is selected during container placement."`
MaxActiveTasksWorker int `long:"max-active-tasks-worker" default:"0" description:"Maximum allowed number of active build tasks per worker. Has effect only when used with fewest-active-tasks placement strategy. 0 means no limit."`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nitpick: can we call this MaxActiveTasksPerWorker? :) I don't think the command-line flag needs to change though.

@aledegano
Copy link
Contributor Author

Hey @aledeganopix4d,

Thanks for working on this feature! I still have a few reservations around ways that the number of active tasks may be inconsistent, but I think based on our discussion and with the changes I have suggested, it should be good for a v1 of this feature. We can document this feature as experimental and allow users to opt in to minimize impact on the general users.

Also regarding PR #4108 that adds in the container creating lock. Could you add a lock in this PR that only locks around running a task step? I understand that you originally wrote #4076 with the assumption that the ContainerCreatingLock would be there, but I don't think we need to lock around all kinds of container creation for this PR. We could add in a TaskStepLock or something similar to ensure that active_tasks for each worker is accurate. This new lock could also be opt-in, i.e. we don't try to serialize the task step if the user has not set max-active-tasks-worker.

I will leave my thoughts on #4108 on that PR.

Last request, can we also get some more test coverage for this feature?

Let me know if you have any questions!

I agree with all your comments and requests.
I'll start working on this ASAP!
Thank you for all your feedback.

@aledegano
Copy link
Contributor Author

Superseded by #4118

@aledegano aledegano closed this Jul 16, 2019
@aledegano aledegano deleted the PCI-675-active_tasks_worker_db branch July 16, 2019 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ATC/worker flags to limit max build containers for workers

3 participants